Masayuki Asahara

2024

pdf abs
Long Unit Word Tokenization and Bunsetsu Segmentation of Historical Japanese
Hiroaki Ozaki | Kanako Komiya | Masayuki Asahara | Toshinobu Ogiso
Proceedings of the 1st Workshop on Machine Learning for Ancient Languages (ML4AL 2024)

In Japanese, the natural minimal phrase of a sentence is the “bunsetsu” and it serves as a natural boundary of a sentence for native speakers rather than words, and thus grammatical analysis in Japanese linguistics commonly operates on the basis of bunsetsu units.In contrast, because Japanese does not have delimiters between words, there are two major categories of word definition, namely, Short Unit Words (SUWs) and Long Unit Words (LUWs).Though a SUW dictionary is available, LUW is not.Hence, this study focuses on providing deep learning-based (or LLM-based) bunsetsu and Long Unit Words analyzer for the Heian period (AD 794-1185) and evaluating its performances.We model the parser as transformer-based joint sequential labels model, which combine bunsetsu BI tag, LUW BI tag, and LUW Part-of-Speech (POS) tag for each SUW token.We train our models on corpora of each period including contemporary and historical Japanese.The results range from 0.976 to 0.996 in f1 value for both bunsetsu and LUW reconstruction indicating that our models achieve comparable performance with models for a contemporary Japanese corpus.Through the statistical analysis and diachronic case study, the estimation of bunsetsu could be influenced by the grammaticalization of morphemes.

pdf abs
Collection of Japanese Route Information Reference Expressions Using Maps as Stimuli
Yoshiko Kawabata | Mai Omura | Hikari Konishi | Masayuki Asahara | Johane Takeuchi
Proceedings of the 4th Workshop on Spatial Language Understanding and Grounded Communication for Robotics (SpLU-RoboNLP 2024)

We constructed a database of Japanese expressions based on route information. Using 20 maps as stimuli, we requested descriptions of routes between two points on each map from 40 individuals per route, collecting 1600 route information reference expressions. We determined whether the expressions were based solely on relative reference expressions by using landmarks on the maps. In cases in which only relative reference expressions were used, we labeled the presence or absence of information regarding the starting point, waypoints, and destination. Additionally, we collected clarity ratings for each expression using a survey.

pdf abs
Prior Knowledge-Guided Adversarial Training
Lis Pereira | Fei Cheng | Wan Jou She | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)

We introduce a simple yet effective Prior Knowledge-Guided ADVersarial Training (PKG-ADV) algorithm to improve adversarial training for natural language understanding. Our method simply utilizes task-specific label distribution to guide the training process. By prioritizing the use of prior knowledge of labels, we aim to generate more informative adversarial perturbations. We apply our model to several challenging temporal reasoning tasks. Our method enables a more reliable and controllable data training process than relying on randomized adversarial perturbation. Albeit simple, our method achieved significant improvements in these tasks. To facilitate further research, we will release the code and models.

2023

pdf abs
UD_Japanese-CEJC: Dependency Relation Annotation on Corpus of Everyday Japanese Conversation
Mai Omura | Hiroshi Matsuda | Masayuki Asahara | Aya Wakasa
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

In this study, we have developed Universal Dependencies (UD) resources for spoken Japanese in the Corpus of Everyday Japanese Conversation (CEJC). The CEJC is a large corpus of spoken language that encompasses various everyday conversations in Japanese, and includes word delimitation and part-of-speech annotation. We have newly annotated Long Word Unit delimitation and Bunsetsu (Japanese phrase)-based dependencies, including Bunsetsu boundaries, for CEJC. The UD of Japanese resources was constructed in accordance with hand-maintained conversion rules from the CEJC with two types of word delimitation, part-of-speech tags and Bunsetsu-based syntactic dependency relations. Furthermore, we examined various issues pertaining to the construction of UD in the CEJC by comparing it with the written Japanese corpus and evaluating UD parsing accuracy.

pdf
Word Familiarity Rate Estimation for Japanese Functional Words Using a Bayesian Linear Mixed Model
Bocheng Chen | Masayuki Asahara
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf
Spatial Information Annotation Based on the Double Cross Model
Yoshiko Kawabata | Mai Omura | Masayuki Asahara | Johane Takeuchi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf
All-Words Word Sense Disambiguation for Historical Japanese
Soma Asada | Kanako Komiya | Masayuki Asahara
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

pdf
Word Sense Disambiguation of Corpus of Historical Japanese Using Japanese BERT Trained with Contemporary Texts
Kanako Komiya | Nagi Oki | Masayuki Asahara
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf abs
CHJ-WLSP: Annotation of ‘Word List by Semantic Principles’ Labels for the Corpus of Historical Japanese
Masayuki Asahara | Nao Ikegami | Tai Suzuki | Taro Ichimura | Asuko Kondo | Sachi Kato | Makoto Yamazaki
Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages

This article presents a word-sense annotation for the Corpus of Historical Japanese: a mashed-up Japanese lexicon based on the ‘Word List by Semantic Principles’ (WLSP). The WLSP is a large-scale Japanese thesaurus that includes 98,241 entries with syntactic and hierarchical semantic categories. The historical WLSP is also compiled for the words in ancient Japanese. We utilized a morpheme-word sense alignment table to extract all possible word sense candidates for each word appearing in the target corpus. Then, we manually disambiguated the word senses for 647,751 words in the texts from the 10th century to 1910.

pdf abs
Reading Time and Vocabulary Rating in the Japanese Language: Large-Scale Japanese Reading Time Data Collection Using Crowdsourcing
Masayuki Asahara
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This study examines how differences in human vocabulary affect reading time. Specifically, we assumed vocabulary to be the random effect of research participants when applying a generalized linear mixed model to the ratings of participants in the word familiarity survey. Thereafter, we asked the participants to take part in a self-paced reading task to collect their reading times. Through fixed effect of vocabulary when applying a generalized linear mixed model to reading time, we clarified the tendency that vocabulary differences give to reading time.

2021

pdf abs
Lower Perplexity is Not Always Human-Like
Tatsuki Kuribayashi | Yohei Oseki | Takumi Ito | Ryo Yoshida | Masayuki Asahara | Kentaro Inui
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.

pdf
Dependency Enhanced Contextual Representations for Japanese Temporal Relation Classification
Chenjing Geng | Fei Cheng | Masayuki Asahara | Lis Kanashiro Pereira | Ichiro Kobayashi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
ALICE++: Adversarial Training for Robust and Effective Temporal Reasoning
Lis Pereira | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
The Annotation of Antonym Information in the ‘Word List by Semantic Principles’
Sachi Kato | Masayuki Asahara | Nanami Moriyama | Makoto Yamazaki Asami Ogiwara
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Word Delimitation Issues in UD Japanese
Mai Omura | Aya Wakasa | Masayuki Asahara
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

2020

pdf abs
Adversarial Training for Commonsense Inference
Lis Pereira | Xiaodong Liu | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 5th Workshop on Representation Learning for NLP

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.

pdf
Generation and Evaluation of Concept Embeddings Via Fine-Tuning Using Automatically Tagged Corpus
Kanako Komiya | Daiki Yaginuma | Masayuki Asahara | Hiroyuki Shinnou
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf
Composing Word Vectors for Japanese Compound Words Using Bilingual Word Embeddings
Teruo Hirabayashi | Kanako Komiya | Masayuki Asahara | Hiroyuki Shinnou
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf abs
Automatic Creation of Correspondence Table of Meaning Tags from Two Dictionaries in One Language Using Bilingual Word Embedding
Teruo Hirabayashi | Kanako Komiya | Masayuki Asahara | Hiroyuki Shinnou
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

In this paper, we show how to use bilingual word embeddings (BWE) to automatically create a corresponding table of meaning tags from two dictionaries in one language and examine the effectiveness of the method. To do this, we had a problem: the meaning tags do not always correspond one-to-one because the granularities of the word senses and the concepts are different from each other. Therefore, we regarded the concept tag that corresponds to a word sense the most as the correct concept tag corresponding the word sense. We used two BWE methods, a linear transformation matrix and VecMap. We evaluated the most frequent sense (MFS) method and the corpus concatenation method for comparison. The accuracies of the proposed methods were higher than the accuracy of the random baseline but lower than those of the MFS and corpus concatenation methods. However, because our method utilized the embedding vectors of the word senses, the relations of the sense tags corresponding to concept tags could be examined by mapping the sense embeddings to the vector space of the concept tags. Also, our methods could be performed when we have only concept or word sense embeddings whereas the MFS method requires a parallel corpus and the corpus concatenation method needs two tagged corpora.

pdf abs
Design of BCCWJ-EEG: Balanced Corpus with Human Electroencephalography
Yohei Oseki | Masayuki Asahara
Proceedings of the Twelfth Language Resources and Evaluation Conference

The past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed.

The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.

pdf abs
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Fei Cheng | Masayuki Asahara | Ichiro Kobayashi | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2020

Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.

2019

pdf bib abs
Word Familiarity Rate Estimation Using a Bayesian Linear Mixed Model
Masayuki Asahara
Proceedings of the First Workshop on Aggregating and Analysing Crowdsourced Annotations for NLP

This paper presents research on word familiarity rate estimation using the ‘Word List by Semantic Principles’. We collected rating information on 96,557 words in the ‘Word List by Semantic Principles’ via Yahoo! crowdsourcing. We asked 3,392 subject participants to use their introspection to rate the familiarity of words based on the five perspectives of ‘KNOW’, ‘WRITE’, ‘READ’, ‘SPEAK’, and ‘LISTEN’, and each word was rated by at least 16 subject participants. We used Bayesian linear mixed models to estimate the word familiarity rates. We also explored the ratings with the semantic labels used in the ‘Word List by Semantic Principles’.

2018

pdf
Between Reading Time and Clause Boundaries in Japanese - Wrap-up Effect in a Head-Final Language
Masayuki Asahara
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Annotation of ‘Word List by Semantic Principles’ Labels for the Balanced Corpus of Contemporary Written Japanese
Sachi Kato | Masayuki Asahara | Makoto Yamazaki
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
All-words Word Sense Disambiguation Using Concept Embeddings
Rui Suzuki | Kanako Komiya | Masayuki Asahara | Minoru Sasaki | Hiroyuki Shinnou
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Predicting Japanese Word Order in Double Object Constructions
Masayuki Asahara | Satoshi Nambu | Shin-Ichiro Sano
Proceedings of the Eight Workshop on Cognitive Aspects of Computational Language Learning and Processing

This paper presents a statistical model to predict Japanese word order in the double object constructions. We employed a Bayesian linear mixed model with manually annotated predicate-argument structure data. The findings from the refined corpus analysis confirmed the effects of information status of an NP as ‘givennew ordering’ in addition to the effects of ‘long-before-short’ as a tendency of the general Japanese word order.

This paper discusses the representation of coordinate structures in the Universal Dependencies framework for two head-final languages, Japanese and Korean. UD applies a strict principle that makes the head of coordination the left-most conjunct. However, the guideline may produce syntactic trees which are difficult to accept in head-final languages. This paper describes the status in the current Japanese and Korean corpora and proposes alternative designs suitable for these languages.

pdf abs
UD-Japanese BCCWJ: Universal Dependencies Annotation for the Balanced Corpus of Contemporary Written Japanese
Mai Omura | Masayuki Asahara
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

In this paper, we describe a corpus UD Japanese-BCCWJ that was created by converting the Balanced Corpus of Contemporary Written Japanese (BCCWJ), a Japanese language corpus, to adhere to the UD annotation schema. The BCCWJ already assigns dependency information at the level of the bunsetsu (a Japanese syntactic unit comparable to the phrase). We developed a program to convert the BCCWJ to UD based on this dependency structure, and this corpus is the result of completely automatic conversion using the program. UD Japanese-BCCWJ is the largest-scale UD Japanese corpus and the second-largest of all UD corpora, including 1,980 documents, 57,109 sentences, and 1,273k words across six distinct domains.

2017

pdf
Between Reading Time and Information Structure
Masayuki Asahara
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf abs
Between Reading Time and Syntactic/Semantic Categories
Masayuki Asahara | Sachi Kato
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This article presents a contrastive analysis between reading time and syntactic/semantic categories in Japanese. We overlaid the reading time annotation of BCCWJ-EyeTrack and a syntactic/semantic category information annotation on the ‘Balanced Corpus of Contemporary Written Japanese’. Statistical analysis based on a mixed linear model showed that verbal phrases tend to have shorter reading times than adjectives, adverbial phrases, or nominal phrases. The results suggest that the preceding phrases associated with the presenting phrases promote the reading process to shorten the gazing time.

2016

pdf abs
BCCWJ-DepPara: A Syntactic Annotation Treebank on the ‘Balanced Corpus of Contemporary Written Japanese’
Masayuki Asahara | Yuji Matsumoto
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are split from and overlaid on bunsetsu-based (base phrase unit) dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward sharing as the set of regions. The annotation was performed on the core data of ‘Balanced Corpus of Contemporary Written Japanese’, which comprised about one million words and 1980 samples from six registers, such as newspapers, books, magazines, and web texts.

We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public.

pdf abs
Reading-Time Annotations for “Balanced Corpus of Contemporary Written Japanese”
Masayuki Asahara | Hajime Ono | Edson T. Miyamoto
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

The Dundee Eyetracking Corpus contains eyetracking data collected while native speakers of English and French read newspaper editorial articles. Similar resources for other languages are still rare, especially for languages in which words are not overtly delimited with spaces. This is a report on a project to build an eyetracking corpus for Japanese. Measurements were collected while 24 native speakers of Japanese read excerpts from the Balanced Corpus of Contemporary Written Japanese Texts were presented with or without segmentation (i.e. with or without space at the boundaries between bunsetsu segmentations) and with two types of methodologies (eyetracking and self-paced reading presentation). Readers’ background information including vocabulary-size estimation and Japanese reading-span score were also collected. As an example of the possible uses for the corpus, we also report analyses investigating the phenomena of anti-locality.

The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents the corpus concordance system named ‘BonTen’ which enables the ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.

pdf abs
Demonstration of ChaKi.NET – beyond the corpus search system
Masayuki Asahara | Yuji Matsumoto | Toshio Morita
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

ChaKi.NET is a corpus management system for dependency structure annotated corpora. After more than 10 years of continuous development, the system is now usable not only for corpus search, but also for visualization, annotation, labelling, and formatting for statistical analysis. This paper describes the various functions included in the current ChaKi.NET system.

2006

Large scale annotated corpora are very important not only inlinguistic research but also in practical natural language processingtasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learning-based systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management tool that provides various functions that include flexible search, statistic calculation, and error correction for linguistically annotated corpora. The target of annotation covers POS tags, base phrase chunks and syntactic dependency structures. This tool aims at helping development of consistent construction of lexicon and annotated corpora to be used by researchers both in linguists and language processing communities.

pdf
Multi-lingual Dependency Parsing at NAIST
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

pdf
The Construction of a Dictionary for a Two-layer Chinese Morphological Analyzer
Chooi-Ling Goh | Jia Lü | Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation