Yoshifumi Kawasaki


2023

pdf
Variance Matters: Detecting Semantic Differences without Corpus/Word Alignment
Ryo Nagata | Hiroya Takamura | Naoki Otani | Yoshifumi Kawasaki
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

In this paper, we propose methods for discovering semantic differences in words appearing in two corpora. The key idea is to measure the coverage of meanings of a word in a corpus through the norm of its mean word vector, which is equivalent to examining a kind of variance of the word vector distribution. The proposed methods do not require alignments between words and/or corpora for comparison that previous methods do. All they require are to compute variance (or norms of mean word vectors) for each word type. Nevertheless, they rival the best-performing system in the SemEval-2020 Task 1. In addition, they are (i) robust for the skew in corpus sizes; (ii) capable of detecting semantic differences in infrequent words; and (iii) effective in pinpointing word instances that have a meaning missing in one of the two corpora under comparison. We show these advantages for historical corpora and also for native/non-native English corpora.

pdf
Revisiting Authorship Attribution of Tirant lo Blanc Using Parts of Speech n-grams
Yoshifumi Kawasaki
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Tirant lo Blanc (TLB) is a masterpiece of medieval Catalan chivalric romance. Regarding its authorship, two hypotheses exist: the single-authorship hypothesis claims in agreement with the dedication that Joanot Martorell is the sole author, whereas the dual-authorship hypothesis alleges in line with the colophon that Martorell wrote the first three parts and Martí Joan de Galba added the fourth part. In this study, we revisit the unsettled authorship attribution of TLB with stylometric techniques; specifically, we exploit parts-of-speech (POS) n-grams as stylistic features to investigate stylistic differences (if any) across the work. Furthermore, we address the distinction between narration and conversation, which has previously been omitted. We performed exploratory multivariate analyses and demonstrated that, despite internal differences, single-authorship is more likely from a statistical point of view. If Galba had contributed something to the last quarter of the work, it would have been minimal.

2022

pdf bib
A Stylometric Analysis of Amadís de Gaula and Sergas de Esplandián
Yoshifumi Kawasaki
Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities

Amadís de Gaula (AG) and its sequel Sergas de Esplandián (SE) are masterpieces of medieval Spanish chivalric romances. Much debate has been devoted to the role played by their purported author Garci Rodríguez de Montalvo. According to the prologue of AG, which consists of four books, the author allegedly revised the first three books that were in circulation at that time and added the fourth book and SE. However, the extent to which Montalvo edited the materials at hand to compose the extant works has yet to be explored extensively. To address this question, we applied stylometric techniques for the first time. Specifically, we investigated the stylistic differences (if any) between the first three books of AG and his own extensions. Literary style is represented as usage of parts-of-speech n-grams. We performed principal component analysis and k-means to demonstrate that Montalvo’s retouching on the first book was minimal, while revising the second and third books in such a way that they came to moderately resemble his authentic creation, that is, the fourth book and SE. Our findings empirically corroborate suppositions formulated from philological viewpoints.

pdf
Revisiting Statistical Laws of Semantic Shift in Romance Cognates
Yoshifumi Kawasaki | Maëlys Salingre | Marzena Karpinska | Hiroya Takamura | Ryo Nagata
Proceedings of the 29th International Conference on Computational Linguistics

This article revisits statistical relationships across Romance cognates between lexical semantic shift and six intra-linguistic variables, such as frequency and polysemy. Cognates are words that are derived from a common etymon, in this case, a Latin ancestor. Despite their shared etymology, some cognate pairs have experienced semantic shift. The degree of semantic shift is quantified using cosine distance between the cognates’ corresponding word embeddings. In the previous literature, frequency and polysemy have been reported to be correlated with semantic shift; however, the understanding of their effects needs revision because of various methodological defects. In the present study, we perform regression analysis under improved experimental conditions, and demonstrate a genuine negative effect of frequency and positive effect of polysemy on semantic shift. Furthermore, we reveal that morphologically complex etyma are more resistant to semantic shift and that the cognates that have been in use over a longer timespan are prone to greater shift in meaning. These findings add to our understanding of the historical process of semantic change.

2018

pdf
A POS Tagging Model Adapted to Learner English
Ryo Nagata | Tomoya Mizumoto | Yuta Kikuchi | Yoshifumi Kawasaki | Kotaro Funakoshi
Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text

There has been very limited work on the adaptation of Part-Of-Speech (POS) tagging to learner English despite the fact that POS tagging is widely used in related tasks. In this paper, we explore how we can adapt POS tagging to learner English efficiently and effectively. Based on the discussion of possible causes of POS tagging errors in learner English, we show that deep neural models are particularly suitable for this. Considering the previous findings and the discussion, we introduce the design of our model based on bidirectional Long Short-Term Memory. In addition, we describe how to adapt it to a wide variety of native languages (potentially, hundreds of them). In the evaluation section, we empirically show that it is effective for POS tagging in learner English, achieving an accuracy of 0.964, which significantly outperforms the state-of-the-art POS-tagger. We further investigate the tagging results in detail, revealing which part of the model design does or does not improve the performance.

2017

pdf
Analyzing Semantic Change in Japanese Loanwords
Hiroya Takamura | Ryo Nagata | Yoshifumi Kawasaki
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We analyze semantic changes in loanwords from English that are used in Japanese (Japanese loanwords). Specifically, we create word embeddings of English and Japanese and map the Japanese embeddings into the English space so that we can calculate the similarity of each Japanese word and each English word. We then attempt to find loanwords that are semantically different from their original, see if known meaning changes are correctly captured, and show the possibility of using our methodology in language education.

2016

pdf
Discriminative Analysis of Linguistic Features for Typological Study
Hiroya Takamura | Ryo Nagata | Yoshifumi Kawasaki
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We address the task of automatically estimating the missing values of linguistic features by making use of the fact that some linguistic features in typological databases are informative to each other. The questions to address in this work are (i) how much predictive power do features have on the value of another feature? (ii) to what extent can we attribute this predictive power to genealogical or areal factors, as opposed to being provided by tendencies or implicational universals? To address these questions, we conduct a discriminative or predictive analysis on the typological database. Specifically, we use a machine-learning classifier to estimate the value of each feature of each language using the values of the other features, under different choices of training data: all the other languages, or all the other languages except for the ones having the same origin or area with the target language.