Daisuke Kawahara

2024

pdf abs
Improving Repository-level Code Search with Text Conversion
Mizuki Kondo | Daisuke Kawahara | Toshiyuki Kurabayashi
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

The ability to generate code using large language models (LLMs) has been increasing year by year. However, studies on code generation at the repository level are not very active. In repository-level code generation, it is necessary to refer to related code snippets among multiple files. By taking the similarity between code snippets, related files are searched and input into an LLM, and generation is performed. This paper proposes a method to search for related files (code search) by taking similarities not between code snippets but between the texts converted from the code snippets by the LLM. We confirmed that converting to text improves the accuracy of code search.

pdf abs
Investigating Web Corpus Filtering Methods for Language Model Development in Japanese
Rintaro Enomoto | Arseny Tolmachev | Takuro Niitsuma | Shuhei Kurita | Daisuke Kawahara
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

The development of large language models (LLMs) is becoming increasingly significant, and there is a demand for high-quality, large-scale corpora for their pretraining.The quality of a web corpus is especially essential to improve the performance of LLMs because it accounts for a large proportion of the whole corpus. However, filtering methods for Web corpora have yet to be established.In this paper, we present empirical studies to reveal which filtering methods are indeed effective and analyze why they are.We build classifiers and language models in Japanese that can process large amounts of corpora rapidly enough for pretraining LLMs in limited computational resources. By evaluating these filtering methods based on a Web corpus quality evaluation benchmark, we reveal that the most accurate method is the N-gram language model. Indeed, we empirically present that strong filtering methods can rather lead to lesser performance in downstream tasks.We also report that the proportion of some specific topics in the processed documents decreases significantly during the filtering process.

pdf abs
Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation
Hao Wang | Tetsuro Morimura | Ukyo Honda | Daisuke Kawahara
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models’ training.

pdf abs
A Benchmark Suite of Japanese Natural Questions
Takuya Uematsu | Hao Wang | Daisuke Kawahara | Tomohide Shibata
Proceedings of the 13th Joint Conference on Lexical and Computational Semantics (*SEM 2024)

To develop high-performance and robust natural language processing (NLP) models, it is important to have various question answering (QA) datasets to train, evaluate, and analyze them. Although there are various QA datasets available in English, there are only a few QA datasets in other languages. We focus on Japanese, a language with only a few basic QA datasets, and aim to build a Japanese version of Natural Questions (NQ) consisting of questions that naturally arise from human information needs. We collect natural questions from query logs of a Japanese search engine and build the dataset using crowdsourcing. We construct Japanese Natural Questions (JNQ) and a Japanese version of BoolQ (JBoolQ), which is derived from NQ and consists of yes/no questions. JNQ consists of 16,871 questions, and JBoolQ consists of 6,467 questions. We also define two tasks from JNQ and one from JBoolQ and establish baselines using competitive methods drawn from related literature. We hope that these datasets will facilitate research on QA and NLP models in Japanese. We are planning to release JNQ and JBoolQ.

pdf abs
Time-aware COMET: A Commonsense Knowledge Model with Temporal Knowledge
Eiki Murata | Daisuke Kawahara
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

To better handle commonsense knowledge, which is difficult to acquire in ordinary training of language models, commonsense knowledge graphs and commonsense knowledge models have been constructed. The former manually and symbolically represents commonsense, and the latter stores these graphs’ knowledge in the models’ parameters. However, the existing commonsense knowledge models that deal with events do not consider granularity or time axes. In this paper, we propose a time-aware commonsense knowledge model, TaCOMET. The construction of TaCOMET consists of two steps. First, we create TimeATOMIC using ChatGPT, which is a commonsense knowledge graph with time. Second, TaCOMET is built by continually finetuning an existing commonsense knowledge model on TimeATOMIC. TimeATOMIC and continual finetuning let the model make more time-aware generations with rich commonsense than the existing commonsense models. We also verify the applicability of TaCOMET on a robotic decision-making task. TaCOMET outperformed the existing commonsense knowledge model when proper times are input. Our dataset and models will be made publicly available.

pdf abs
SlideAVSR: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition
Hao Wang | Shuhei Kurita | Shuichiro Shimizu | Daisuke Kawahara
Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR)

Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR.

2023

pdf
Exploring Automatic Evaluation Methods based on a Decoder-based LLM for Text Generation
Tomohito Kasahara | Daisuke Kawahara
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf abs
Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models
Hao Wang | Hirofumi Shimizu | Daisuke Kawahara
Findings of the Association for Computational Linguistics: ACL 2023

Recent studies in natural language processing (NLP) have focused on modern languages and achieved state-of-the-art results in many tasks. Meanwhile, little attention has been paid to ancient texts and related tasks. Classical Chinese first came to Japan approximately 2,000 years ago. It was gradually adapted to a Japanese form called Kanbun-Kundoku (Kanbun) in Japanese reading and translating methods, which has significantly impacted Japanese literature. However, compared to the rich resources of ancient texts in mainland China, Kanbun resources remain scarce in Japan.To solve this problem, we construct the first Classical-Chinese-to-Kanbun dataset in the world. Furthermore, we introduce two tasks, character reordering and machine translation, both of which play a significant role in Kanbun comprehension. We also test the current language models on these tasks and discuss the best evaluation method by comparing the results with human scores. We release our code and dataset on GitHub.

pdf abs
KWJA: A Unified Japanese Analyzer Based on Foundation Models
Nobuhiro Ueda | Kazumasa Omura | Takashi Kodama | Hirokazu Kiyomaru | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

We present KWJA, a high-performance unified Japanese text analyzer based on foundation models.KWJA supports a wide range of tasks, including typo correction, word segmentation, word normalization, morphological analysis, named entity recognition, linguistic feature tagging, dependency parsing, PAS analysis, bridging reference resolution, coreference resolution, and discourse relation analysis, making it the most versatile among existing Japanese text analyzers.KWJA solves these tasks in a multi-task manner but still achieves competitive or better performance compared to existing analyzers specialized for each task.KWJA is publicly available under the MIT license at https://github.com/ku-nlp/kwja.

pdf abs
Theoretical Linguistics Rivals Embeddings in Language Clustering for Multilingual Named Entity Recognition
Sakura Imai | Daisuke Kawahara | Naho Orita | Hiromune Oda
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

While embedding-based methods have been dominant in language clustering for multilingual tasks, clustering based on linguistic features has not yet been explored much, as it remains baselines (Tan et al., 2019; Shaffer, 2021). This study investigates whether and how theoretical linguistics improves language clustering for multilingual named entity recognition (NER). We propose two types of language groupings: one based on morpho-syntactic features in a nominal domain and one based on a head parameter. Our NER experiments show that the proposed methods largely outperform a state-of-the-art embedding-based model, suggesting that theoretical linguistics plays a significant role in multilingual learning tasks.

2022

pdf bib abs
Grounding in social media: An approach to building a chit-chat dialogue model
Ritvik Choudhary | Daisuke Kawahara
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Building open-domain dialogue systems capable of rich human-like conversational ability is one of the fundamental challenges in language generation. However, even with recent advancements in the field, existing open-domain generative models fail to capture and utilize external knowledge, leading to repetitive or generic responses to unseen utterances. Current work on knowledge-grounded dialogue generation primarily focuses on persona incorporation or searching a fact-based structured knowledge source such as Wikipedia. Our method takes a broader and simpler approach, which aims to improve the raw conversation ability of the system by mimicking the human response behavior through casual interactions found on social media. Utilizing a joint retriever-generator setup, the model queries a large set of filtered comment data from Reddit to act as additional context for the seq2seq generator. Automatic and human evaluations on open-domain dialogue datasets demonstrate the effectiveness of our approach.

pdf abs
Generate, Evaluate, and Select: A Dialogue System with a Response Evaluator for Diversity-Aware Response Generation
Ryoma Sakaeda | Daisuke Kawahara
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

We aim to overcome the lack of diversity in responses of current dialogue systems and to develop a dialogue system that is engaging as a conversational partner. We propose a generator-evaluator model that evaluates multiple responses generated by a response generator and selects the best response by an evaluator. By generating multiple responses, we obtain diverse responses. We conduct human evaluations to compare the output of the proposed system with that of a baseline system. The results of the human evaluations showed that the proposed system’s responses were often judged to be better than the baseline system’s, and indicated the effectiveness of the proposed method.

pdf abs
Building a Personalized Dialogue System with Prompt-Tuning
Tomohito Kasahara | Daisuke Kawahara | Nguyen Tung | Shengzhe Li | Kenta Shinzato | Toshinori Sato
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Dialogue systems without consistent responses are not attractive. In this study, we build a dialogue system that can respond based on a given character setting (persona) to bring consistency. Considering the trend of the rapidly increasing scale of language models, we propose an approach that uses prompt-tuning, which has low learning costs, on pre-trained large-scale language models. The results of the automatic and manual evaluations in English and Japanese show that it is possible to build a dialogue system with more natural and personalized responses with less computational resources than fine-tuning.

pdf abs
Building a Dialogue Corpus Annotated with Expressed and Experienced Emotions
Tatsuya Ide | Daisuke Kawahara
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In communication, a human would recognize the emotion of an interlocutor and respond with an appropriate emotion, such as empathy and comfort. Toward developing a dialogue system with such a human-like ability, we propose a method to build a dialogue corpus annotated with two kinds of emotions. We collect dialogues from Twitter and annotate each utterance with the emotion that a speaker put into the utterance (expressed emotion) and the emotion that a listener felt after listening to the utterance (experienced emotion). We built a dialogue corpus in Japanese using this method, and its statistical analysis revealed the differences between expressed and experienced emotions. We conducted experiments on recognition of the two kinds of emotions. The experimental results indicated the difficulty in recognizing experienced emotions and the effectiveness of multi-task learning of the two kinds of emotions. We hope that the constructed corpus will facilitate the study on emotion recognition in a dialogue and emotion-aware dialogue response generation.

pdf abs
JGLUE: Japanese General Language Understanding Evaluation
Kentaro Kurihara | Daisuke Kawahara | Tomohide Shibata
Proceedings of the Thirteenth Language Resources and Evaluation Conference

To develop high-performance natural language understanding (NLU) models, it is necessary to have a benchmark to evaluate and analyze NLU ability from various perspectives. While the English NLU benchmark, GLUE, has been the forerunner, benchmarks are now being released for languages other than English, such as CLUE for Chinese and FLUE for French; but there is no such benchmark for Japanese. We build a Japanese NLU benchmark, JGLUE, from scratch without translation to measure the general NLU ability in Japanese. We hope that JGLUE will facilitate NLU research in Japanese.

2021

pdf abs
Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response Generation
Tatsuya Ide | Daisuke Kawahara
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

For a computer to naturally interact with a human, it needs to be human-like. In this paper, we propose a neural response generation model with multi-task learning of generation and classification, focusing on emotion. Our model based on BART (Lewis et al., 2020), a pre-trained transformer encoder-decoder model, is trained to generate responses and recognize emotions simultaneously. Furthermore, we weight the losses for the tasks to control the update of parameters. Automatic evaluations and crowdsourced manual evaluations show that the proposed model makes generated responses more emotionally aware.

2020

pdf abs
Acquiring Social Knowledge about Personality and Driving-related Behavior
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we introduce our psychological approach to collect human-specific social knowledge from a text corpus, using NLP techniques. It is often not explicitly described but shared among people, which we call social knowledge. We focus on the social knowledge, especially personality and driving. We used the language resources that were developed based on psychological research methods; a Japanese personality dictionary (317 words) and a driving experience corpus (8,080 sentences) annotated with behavior and subjectivity. Using them, we automatically extracted collocations between personality descriptors and driving-related behavior from a driving behavior and subjectivity corpus (1,803,328 sentences after filtering) and obtained unique 5,334 collocations. To evaluate the collocations as social knowledge, we designed four step-by-step crowdsourcing tasks. They resulted in 266 pieces of social knowledge. They include the knowledge that might be difficult to recall by themselves but easy to agree with. We discuss the acquired social knowledge and the contribution to implementations into systems.

pdf abs
Development of a Japanese Personality Dictionary based on Psychological Methods
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference

We propose a new approach to constructing a personality dictionary with psychological evidence. In this study, we collect personality words, using word embeddings, and construct a personality dictionary with weights for Big Five traits. The weights are calculated based on the responses of the large sample (N=1,938, female = 1,004, M=49.8years old:20-78, SD=16.3). All the respondents answered a 20-item personality questionnaire and 537 personality items derived from word embeddings. We present the procedures to examine the qualities of responses with psychological methods and to calculate the weights. These result in a personality dictionary with two sub-dictionaries. We also discuss an application of the acquired resources.

pdf abs
Building a Japanese Typo Dataset from Wikipedia’s Revision History
Yu Tanaka | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

pdf abs
A Method for Building a Commonsense Inference Dataset based on Basic Events
Kazumasa Omura | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present a scalable, low-bias, and low-cost method for building a commonsense inference dataset that combines automatic extraction from a corpus and crowdsourcing. Each problem is a multiple-choice question that asks contingency between basic events. We applied the proposed method to a Japanese corpus and acquired 104k problems. While humans can solve the resulting problems with high accuracy (88.9%), the accuracy of a high-performance transfer learning model is reasonably low (76.0%). We also confirmed through dataset analysis that the resulting dataset contains low bias. We released the datatset to facilitate language understanding research.

pdf abs
BERT-based Cohesion Analysis of Japanese Texts
Nobuhiro Ueda | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

The meaning of natural language text is supported by cohesion among various kinds of entities, including coreference relations, predicate-argument structures, and bridging anaphora relations. However, predicate-argument structures for nominal predicates and bridging anaphora relations have not been studied well, and their analyses have been still very difficult. Recent advances in neural networks, in particular, self training-based language models including BERT (Devlin et al., 2019), have significantly improved many natural language processing tasks, making it possible to dive into the study on analysis of cohesion in the whole text. In this study, we tackle an integrated analysis of cohesion in Japanese texts. Our results significantly outperformed existing studies in each task, especially about 10 to 20 point improvement both for zero anaphora and coreference resolution. Furthermore, we also showed that coreference resolution is different in nature from the other tasks and should be treated specially.

Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These methods introduce exposure bias, which may cause the models overfit to the frequent label combination, thus limiting the generalization ability. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at https://github.com/WindChimeRan/OpenJERE.

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

2019

pdf
Applying Machine Translation to Psychology: Automatic Translation of Personality Adjectives
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf abs
Tree-structured Decoding for Solving Math Word Problems
Qianying Liu | Wenyv Guan | Sujian Li | Daisuke Kawahara
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Automatically solving math word problems is an interesting research topic that needs to bridge natural language descriptions and formal math equations. Previous studies introduced end-to-end neural network methods, but these approaches did not efficiently consider an important characteristic of the equation, i.e., an abstract syntax tree. To address this problem, we propose a tree-structured decoding method that generates the abstract syntax tree of the equation in a top-down manner. In addition, our approach can automatically stop during decoding without a redundant stop token. The experimental results show that our method achieves single model state-of-the-art performance on Math23K, which is the largest dataset on this task.

pdf abs
Machine Comprehension Improves Domain-Specific Japanese Predicate-Argument Structure Analysis
Norio Takahashi | Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

To improve the accuracy of predicate-argument structure (PAS) analysis, large-scale training data and knowledge for PAS analysis are indispensable. We focus on a specific domain, specifically Japanese blogs on driving, and construct two wide-coverage datasets as a form of QA using crowdsourcing: a PAS-QA dataset and a reading comprehension QA (RC-QA) dataset. We train a machine comprehension (MC) model based on these datasets to perform PAS analysis. Our experiments show that a stepwise training method is the most effective, which pre-trains an MC model based on the RC-QA dataset to acquire domain knowledge and then fine-tunes based on the PAS-QA dataset.

pdf abs
Diversity-aware Event Prediction based on a Conditional Variational Autoencoder with Reconstruction
Hirokazu Kiyomaru | Kazumasa Omura | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Typical event sequences are an important class of commonsense knowledge. Formalizing the task as the generation of a next event conditioned on a current event, previous work in event prediction employs sequence-to-sequence (seq2seq) models. However, what can happen after a given event is usually diverse, a fact that can hardly be captured by deterministic models. In this paper, we propose to incorporate a conditional variational autoencoder (CVAE) into seq2seq for its ability to represent diverse next events as a probabilistic distribution. We further extend the CVAE-based seq2seq with a reconstruction mechanism to prevent the model from concentrating on highly typical events. To facilitate fair and systematic evaluation of the diversity-aware models, we also extend existing evaluation datasets by tying each current event to multiple next events. Experiments show that the CVAE-based models drastically outperform deterministic models in terms of precision and that the reconstruction mechanism improves the recall of CVAE-based models without sacrificing precision.

pdf abs
Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

For languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one: it uses less than 15 megabytes of space.

2018

pdf abs
Knowledge-Enriched Two-Layered Attention Network for Sentiment Analysis
Abhishek Kumar | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised model based on Support Vector Regression using a Multilayer Perceptron network for sentiment analysis. We evaluate our model on the benchmark dataset of SemEval 2017 Task 5. Experimental results show that the proposed model surpasses the top system of SemEval 2017 Task 5. The model performs significantly better by improving the state-of-the-art system at SemEval 2017 Task 5 by 1.7 and 3.7 points for sub-tracks 1 and 2 respectively.

pdf abs
Juman++: A Morphological Analysis Toolkit for Scriptio Continua
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a three-part toolkit for developing morphological analyzers for languages without natural word boundaries. The first part is a C++11/14 lattice-based morphological analysis library that uses a combination of linear and recurrent neural net language models for analysis. The other parts are a tool for exposing problems in the trained model and a partial annotation tool. Our morphological analyzer of Japanese achieves new SOTA on Jumandic-based corpora while being 250 times faster than the previous one. We also perform a small experiment and quantitive analysis and experience of using development tools. All components of the toolkit is open source and available under a permissive Apache 2 License.

pdf abs
Neural Adversarial Training for Semi-supervised Japanese Predicate-argument Structure Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Japanese predicate-argument structure (PAS) analysis involves zero anaphora resolution, which is notoriously difficult. To improve the performance of Japanese PAS analysis, it is straightforward to increase the size of corpora annotated with PAS. However, since it is prohibitively expensive, it is promising to take advantage of a large amount of raw corpora. In this paper, we propose a novel Japanese PAS analysis model based on semi-supervised adversarial training with a raw corpus. In our experiments, our model outperforms existing state-of-the-art models for Japanese PAS analysis.

pdf
Annotating a Driving Experience Corpus with Behavior and Subjectivity
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Comprehensive Annotation of Various Types of Temporal Information on the Time Axis
Tomohiro Sakaguchi | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
JFCKB: Japanese Feature Change Knowledge Base
Tetsuaki Nakamura | Daisuke Kawahara
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
JDCFC: A Japanese Dialogue Corpus with Feature Changes
Tetsuaki Nakamura | Daisuke Kawahara
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Improving Crowdsourcing-Based Annotation of Japanese Discourse Relations
Yudai Kishimoto | Shinnosuke Sawada | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf abs
Cross-lingual Knowledge Projection Using Machine Translation and Target-side Knowledge Base Completion
Naoki Otani | Hirokazu Kiyomaru | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Considerable effort has been devoted to building commonsense knowledge bases. However, they are not available in many languages because the construction of KBs is expensive. To bridge the gap between languages, this paper addresses the problem of projecting the knowledge in English, a resource-rich language, into other languages, where the main challenge lies in projection ambiguity. This ambiguity is partially solved by machine translation and target-side knowledge base completion, but neither of them is adequately reliable by itself. We show their combination can project English commonsense knowledge into Japanese and Chinese with high precision. Our method also achieves a top-10 accuracy of 90% on the crowdsourced English–Japanese benchmark. Furthermore, we use our method to obtain 18,747 facts of accurate Japanese commonsense within a very short period.

2017

pdf abs
Improving Chinese Semantic Role Labeling using High-quality Surface and Deep Case Frames
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a method for applying automatically acquired knowledge to semantic role labeling (SRL). We use a large amount of automatically extracted knowledge to improve the performance of SRL. We present two varieties of knowledge, which we call surface case frames and deep case frames. Although the surface case frames are compiled from syntactic parses and can be used as rich syntactic knowledge, they have limited capability for resolving semantic ambiguity. To compensate the deficiency of the surface case frames, we compile deep case frames from automatic semantic roles. We also consider quality management for both types of knowledge in order to get rid of the noise brought from the automatic analyses. The experimental results show that Chinese SRL can be improved using automatically acquired knowledge and the quality management shows a positive effect on this task.

pdf bib abs
Automatically Acquired Lexical Knowledge Improves Japanese Joint Morphological and Dependency Analysis
Daisuke Kawahara | Yuta Hayashibe | Hajime Morita | Sadao Kurohashi
Proceedings of the 15th International Conference on Parsing Technologies

This paper presents a joint model for morphological and dependency analysis based on automatically acquired lexical knowledge. This model takes advantage of rich lexical knowledge to simultaneously resolve word segmentation, POS, and dependency ambiguities. In our experiments on Japanese, we show the effectiveness of our joint model over conventional pipeline models.

pdf abs
Neural Joint Model for Transition-based Chinese Syntactic Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present neural network-based joint models for Chinese word segmentation, POS tagging and dependency parsing. Our models are the first neural approaches for fully joint Chinese analysis that is known to prevent the error propagation problem of pipeline models. Although word embeddings play a key role in dependency parsing, they cannot be applied directly to the joint task in the previous work. To address this problem, we propose embeddings of character strings, in addition to words. Experiments show that our models outperform existing systems in Chinese word segmentation and POS tagging, and perform preferable accuracies in dependency parsing. We also explore bi-LSTM models with fewer features.

2016

pdf abs
Consistent Word Segmentation, Part-of-Speech Tagging and Dependency Labelling Annotation for Chinese Language
Mo Shen | Wingmui Li | HyunJeong Choe | Chenhui Chu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a new annotation approach to Chinese word segmentation, part-of-speech (POS) tagging and dependency labelling that aims to overcome the two major issues in traditional morphology-based annotation: Inconsistency and data sparsity. We re-annotate the Penn Chinese Treebank 5.0 (CTB5) and demonstrate the advantages of this approach compared to the original CTB5 annotation through word segmentation, POS tagging and machine translation experiments.

pdf
IRT-based Aggregation Model of Crowdsourced Pairwise Comparison for Evaluating Machine Translations
Naoki Otani | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Constructing a Dictionary Describing Feature Changes of Arguments in Event Sentences
Tetsuaki Nakamura | Daisuke Kawahara
Proceedings of the Fourth Workshop on Events

pdf
Design of Word Association Games using Dialog Systems for Acquisition of Word Association Knowledge
Yuichiro Machida | Daisuke Kawahara | Sadao Kurohashi | Manabu Sassano
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib abs
Large-Scale Acquisition of Commonsense Knowledge via a Quiz Game on a Dialogue System
Naoki Otani | Daisuke Kawahara | Sadao Kurohashi | Nobuhiro Kaji | Manabu Sassano
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

Commonsense knowledge is essential for fully understanding language in many situations. We acquire large-scale commonsense knowledge from humans using a game with a purpose (GWAP) developed on a smartphone spoken dialogue system. We transform the manual knowledge acquisition process into an enjoyable quiz game and have collected over 150,000 unique commonsense facts by gathering the data of more than 70,000 players over eight months. In this paper, we present a simple method for maintaining the quality of acquired knowledge and an empirical analysis of the knowledge acquisition process. To the best of our knowledge, this is the first work to collect large-scale knowledge via a GWAP on a widely-used spoken dialogue system.

pdf abs
SCTB: A Chinese Treebank in Scientific Domain
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Treebanks are curial for natural language processing (NLP). In this paper, we present our work for annotating a Chinese treebank in scientific domain (SCTB), to address the problem of the lack of Chinese treebanks in this domain. Chinese analysis and machine translation experiments conducted using this treebank indicate that the annotated treebank can significantly improve the performance on both tasks. This treebank is released to promote Chinese NLP research in scientific domain.

pdf
Neural Network-Based Model for Japanese Predicate Argument Structure Analysis
Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
M2L at SemEval-2016 Task 8: AMR Parsing with Neural Networks
Yevgeniy Puzikov | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf
Leveraging VerbNet to build Corpus-Specific Verb Clusters
Daniel Peterson | Jordan Boyd-Graber | Martha Palmer | Daisuke Kawahara
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf
Corpus Patterns for Semantic Processing
Octavian Popescu | Patrick Hanks | Elisabetta Jezek | Daisuke Kawahara
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: Tutorial Abstracts

pdf
Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model
Hajime Morita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Classification and Acquisition of Contradictory Event Pairs using Crowdsourcing
Yu Takabatake | Hajime Morita | Daisuke Kawahara | Sadao Kurohashi | Ryuichiro Higashinaka | Yoshihiro Matsuo
Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Location Name Disambiguation Exploiting Spatial Proximity and Temporal Consistency
Takashi Awamura | Daisuke Kawahara | Eiji Aramaki | Tomohide Shibata | Sadao Kurohashi
Proceedings of the third International Workshop on Natural Language Processing for Social Media

pdf
Chinese Semantic Role Labeling using High-quality Syntactic Knowledge
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

2014

pdf abs
A Framework for Compiling High Quality Knowledge Resources From Raw Corpora
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The identification of various types of relations is a necessary step to allow computers to understand natural language text. In particular, the clarification of relations between predicates and their arguments is essential because predicate-argument structures convey most of the information in natural languages. To precisely capture these relations, wide-coverage knowledge resources are indispensable. Such knowledge resources can be derived from automatic parses of raw corpora, but unfortunately parsing still has not achieved a high enough performance for precise knowledge acquisition. We present a framework for compiling high quality knowledge resources from raw corpora. Our proposed framework selects high quality dependency relations from automatic parses and makes use of them for not only the calculation of fundamental distributional similarity but also the acquisition of knowledge such as case frames.

pdf abs
Single Classifier Approach for Verb Sense Disambiguation based on Generalized Features
Daisuke Kawahara | Martha Palmer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present a supervised method for verb sense disambiguation based on VerbNet. Most previous supervised approaches to verb sense disambiguation create a classifier for each verb that reaches a frequency threshold. These methods, however, have a significant practical problem that they cannot be applied to rare or unseen verbs. In order to overcome this problem, we create a single classifier to be applied to rare or unseen verbs in a new text. This single classifier also exploits generalized semantic features of a verb and its modifiers in order to better deal with rare or unseen verbs. Our experimental results show that the proposed method achieves equivalent performance to per-verb classifiers, which cannot be applied to unseen verbs. Our classifier could be utilized to improve the classifications in lexical resources of verbs, such as VerbNet, in a semi-automatic manner and to possibly extend the coverage of these resources to new verbs.

pdf abs
Post-editing user interface using visualization of a sentence structure
Yudai Kishimoto | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

Translation has become increasingly important by virtue of globalization. To reduce the cost of translation, it is necessary to use machine translation and further to take advantage of post-editing based on the result of a machine translation for accurate information dissemination. Such post-editing (e.g., PET [Aziz et al., 2012]) can be used practically for translation between European languages, which has a high performance in statistical machine translation. However, due to the low accuracy of machine translation between languages with different word order, such as Japanese-English and Japanese-Chinese, post-editing has not been used actively.

pdf
A Step-wise Usage-based Method for Inducing Polysemy-aware Verb Classes
Daisuke Kawahara | Daniel W. Peterson | Martha Palmer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Chinese Morphological Analysis with Character-level POS Tagging
Mo Shen | Hongxiao Liu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing
Daisuke Kawahara | Yuichiro Machida | Tomohide Shibata | Sadao Kurohashi | Hayato Kobayashi | Manabu Sassano
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Inducing Example-based Semantic Frames from a Massive Amount of Verb Uses
Daisuke Kawahara | Daniel Peterson | Octavian Popescu | Martha Palmer
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

We present a method for acquiring reliable predicate-argument structures from raw corpora for automatic compilation of case frames. Such lexicon compilation requires highly reliable predicate-argument structures to practically contribute to Natural Language Processing (NLP) applications, such as paraphrasing, text entailment, and machine translation. However, to precisely identify predicate-argument structures, case frames are required. This issue is similar to the question ""what came first: the chicken or the egg?"" In this paper, we propose the first step in the extraction of reliable predicate-argument structures without using case frames. We first apply chunking to raw corpora and then extract reliable chunks to ensure that high-quality predicate-argument structures are obtained from the chunks. We conducted experiments to confirm the effectiveness of our approach. We successfully extracted reliable chunks of an accuracy of 98% and high-quality predicate-argument structures of an accuracy of 97%. Our experiments confirmed that we succeeded in acquiring highly reliable predicate-argument structures that can be used to compile case frames.

2009

pdf
Capturing Consistency between Intra-clause and Inter-clause Relations in Knowledge-rich Dependency and Case Structure Analysis
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

pdf
Mining Parallel Texts from Mixed-Language Web Pages
Masao Utiyama | Daisuke Kawahara | Keiji Yasuda | Eiichiro Sumita
Proceedings of Machine Translation Summit XII: Papers

pdf
The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf
Dependency Parsing with Short Dependency Relations in Unlabeled Data
Wenliang Chen | Daisuke Kawahara | Kiyotaka Uchimoto | Yujie Zhang | Hitoshi Isahara
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology
Keiji Shinzato | Tomohide Shibata | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
Learning Reliability of Parses for Domain Adaptation of Dependency Parsing
Daisuke Kawahara | Kiyotaka Uchimoto
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf
Coordination Disambiguation without Any Similarities
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
A Fully-Lexicalized Probabilistic Model for Japanese Zero Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
Chinese Dependency Parsing with Large Scale Automatically Constructed Case Structures
Kun Yu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf abs
A Method for Automatically Constructing Case Frames for English
Daisuke Kawahara | Kiyotaka Uchimoto
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Case frames are an important knowledge base for a variety of natural language processing (NLP) systems. For the practical use of these systems in the real world, wide-coverage case frames are required. In order to acquire such large-scale case frames, in this paper, we automatically compile case frames from a large corpus. The resultant case frames that are compiled from the English Gigaword corpus contain 9,300 verb entries. The case frames include most examples of normal usage, and are ready to be used in numerous NLP analyzers and applications.

pdf abs
A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Keiji Shinzato | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.

pdf
Construction of an Idiom Corpus and its Application to Idiom Identification based on WSD Incorporating Idiom-Specific Features
Chikara Hashimoto | Daisuke Kawahara
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf
Minimally Lexicalized Dependency Parsing
Daisuke Kawahara | Kiyotaka Uchimoto
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf
Probabilistic Coordination Disambiguation in a Fully-Lexicalized Japanese Parser
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf
Example-based machine translation based on deeper NLP
Toshiaki Nakazawa | Kun Yu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf abs
Case Frame Compilation from the Web using High-Performance Computing
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Case frames are important knowledge for a variety of NLP systems, especially when wide-coverage case frames are available. To acquire such large-scale case frames, it is necessary to automatically compile them from an enormous amount of corpus. In this paper, we consider the web as a corpus. We first build a huge text corpus from the web, and then construct case frames from the corpus. It is infeasible to do these processes by one CPU, and thus we employ a high-performance computing environment composed of 350 CPUs. The acquired corpus consists of 470M sentences, and the case frames compiled from them have 90,000 verb entries. The case frames contain most examples of usual use, and are ready to be applied to lots of NLP analyses and applications.

pdf
A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

2005

pdf
Example-based Machine Translation Pursuing Fully Structural NLP
Sadao Kurohashi | Toshiaki Nakazawa | Kauffmann Alexis | Daisuke Kawahara
Proceedings of the Second International Workshop on Spoken Language Translation

pdf
PP-Attachment Disambiguation Boosted by a Gigantic Volume of Unambiguous Examples
Daisuke Kawahara | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus
Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf
Converting Text into Agent Animations: Assigning Gestures to Text
Yukiko I. Nakano | Masashi Okamoto | Daisuke Kawahara | Qing Li | Toyoaki Nishida
Proceedings of HLT-NAACL 2004: Short Papers

pdf
Improving Japanese Zero Pronoun Resolution by Global Word Sense Disambiguation
Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Automatic Construction of Nominal Case Frames and its Application to Indirect Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf abs
Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames
Daisuke Kawahara | Ryohei Sasano | Sadao Kurohashi
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.