Sadao Kurohashi

2021

pdf bib abs
Extractive Summarization Considering Discourse and Coreference Relations based on Heterogeneous Graph
Yin Jou Huang | Sadao Kurohashi
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Modeling the relations between text spans in a document is a crucial yet challenging problem for extractive summarization. Various kinds of relations exist among text spans of different granularity, such as discourse relations between elementary discourse units and coreference relations between phrase mentions. In this paper, we propose a heterogeneous graph based model for extractive summarization that incorporates both discourse and coreference relations. The heterogeneous graph contains three types of nodes, each corresponds to text spans of different granularity. Experimental results on a benchmark summarization dataset verify the effectiveness of our proposed method.

pdf bib abs
Japanese Zero Anaphora Resolution Can Benefit from Parallel Texts Through Neural Transfer Learning
Masato Umakoshi | Yugo Murawaki | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2021

Parallel texts of Japanese and a non-pro-drop language have the potential of improving the performance of Japanese zero anaphora resolution (ZAR) because pronouns dropped in the former are usually mentioned explicitly in the latter. However, rule-based cross-lingual transfer is hampered by error propagation in an NLP pipeline and the frequent lack of transparency in translation correspondences. In this paper, we propose implicit transfer by injecting machine translation (MT) as an intermediate task between pretraining and ZAR. We employ a pretrained BERT model to initialize the encoder part of the encoder-decoder model for MT, and eject the encoder part for fine-tuning on ZAR. The proposed framework empirically demonstrates that ZAR performance can be improved by transfer learning from MT. In addition, we find that the incorporation of the masked language model training into MT leads to further gains.

This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 teams submitted their translation results for the human evaluation. We also accepted 5 research papers. About 2,100 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Frustratingly Easy Edit-based Linguistic Steganography with a Masked Language Model
Honai Ueoka | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

With advances in neural language models, the focus of linguistic steganography has shifted from edit-based approaches to generation-based ones. While the latter’s payload capacity is impressive, generating genuine-looking texts remains challenging. In this paper, we revisit edit-based linguistic steganography, with the idea that a masked language model offers an off-the-shelf solution. The proposed method eliminates painstaking rule construction and has a high payload capacity for an edit-based model. It is also shown to be more secure against automatic detection than a generation-based method while offering better control of the security/payload capacity trade-off.

pdf bib abs
Contextualized and Generalized Sentence Representations by Contrastive Self-Supervised Learning: A Case Study on Discourse Relation Analysis
Hirokazu Kiyomaru | Sadao Kurohashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a method to learn contextualized and generalized sentence representations using contrastive self-supervised learning. In the proposed method, a model is given a text consisting of multiple sentences. One sentence is randomly selected as a target sentence. The model is trained to maximize the similarity between the representation of the target sentence with its context and that of the masked target sentence with the same context. Simultaneously, the model minimizes the similarity between the latter representation and the representation of a random sentence with the same context. We apply our method to discourse relation analysis in English and Japanese and show that it outperforms strong baseline methods based on BERT, XLNet, and RoBERTa.

pdf bib abs
Lightweight Cross-Lingual Sentence Representation Learning
Zhuoyuan Mao | Prakhar Gupta | Chenhui Chu | Martin Jaggi | Sadao Kurohashi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model.

pdf bib abs
Video-guided Machine Translation with Spatial Hierarchical Attention Network
Weiqi Gu | Haiyue Song | Chenhui Chu | Sadao Kurohashi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Video-guided machine translation, as one type of multimodal machine translations, aims to engage video contents as auxiliary information to address the word sense ambiguity problem in machine translation. Previous studies only use features from pretrained action detection models as motion representations of the video to solve the verb sense ambiguity, leaving the noun sense ambiguity a problem. To address this problem, we propose a video-guided machine translation system by using both spatial and motion representations in videos. For spatial features, we propose a hierarchical attention network to model the spatial information from object-level to video-level. Experiments on the VATEX dataset show that our system achieves 35.86 BLEU-4 score, which is 0.51 score higher than the single model of the SOTA method.

2020

pdf bib abs
A Method for Building a Commonsense Inference Dataset based on Basic Events
Kazumasa Omura | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present a scalable, low-bias, and low-cost method for building a commonsense inference dataset that combines automatic extraction from a corpus and crowdsourcing. Each problem is a multiple-choice question that asks contingency between basic events. We applied the proposed method to a Japanese corpus and acquired 104k problems. While humans can solve the resulting problems with high accuracy (88.9%), the accuracy of a high-performance transfer learning model is reasonably low (76.0%). We also confirmed through dataset analysis that the resulting dataset contains low bias. We released the datatset to facilitate language understanding research.

This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020). For the WAT2020, 20 teams participated in the shared tasks and 14 teams submitted their translation results for the human evaluation. We also received 12 research paper submissions out of which 7 were accepted. About 500 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Meta Ensemble for Japanese-Chinese Neural Machine Translation: Kyoto-U+ECNU Participation to WAT 2020
Zhuoyuan Mao | Yibin Shen | Chenhui Chu | Sadao Kurohashi | Cheqing Jin
Proceedings of the 7th Workshop on Asian Translation

This paper describes the Japanese-Chinese Neural Machine Translation (NMT) system submitted by the joint team of Kyoto University and East China Normal University (Kyoto-U+ECNU) to WAT 2020 (Nakazawa et al.,2020). We participate in APSEC Japanese-Chinese translation task. We revisit several techniques for NMT including various architectures, different data selection and augmentation methods, denoising pre-training, and also some specific tricks for Japanese-Chinese translation. We eventually perform a meta ensemble to combine all of the models into a single model. BLEU results of this meta ensembled model rank the first both on 2 directions of ASPEC Japanese-Chinese translation.

The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education. Meanwhile, the COVID-19 condition is very different among the countries (e.g., policies and development of the epidemic), and thus citizens would be interested in news in foreign countries. We build a system for worldwide COVID-19 information aggregation containing reliable articles from 10 regions in 7 languages sorted by topics. Our reliable COVID-19 related website dataset collected through crowdsourcing ensures the quality of the articles. A neural machine translation module translates articles in other languages into Japanese and English. A BERT-based topic-classifier trained on our article-topic pair dataset helps users find their interested information efficiently by putting articles into different categories.

Joint entity and relation extraction aims to extract relation triplets from plain text directly. Prior work leverages Sequence-to-Sequence (Seq2Seq) models for triplet sequence generation. However, Seq2Seq enforces an unnecessary order on the unordered triplets and involves a large decoding length associated with error accumulation. These methods introduce exposure bias, which may cause the models overfit to the frequent label combination, thus limiting the generalization ability. We propose a novel Sequence-to-Unordered-Multi-Tree (Seq2UMTree) model to minimize the effects of exposure bias by limiting the decoding length to three within a triplet and removing the order among triplets. We evaluate our model on two datasets, DuIE and NYT, and systematically study how exposure bias alters the performance of Seq2Seq models. Experiments show that the state-of-the-art Seq2Seq model overfits to both datasets while Seq2UMTree shows significantly better generalization. Our code is available at https://github.com/WindChimeRan/OpenJERE.

pdf bib abs
Dynamically Updating Event Representations for Temporal Relation Classification with Multi-category Learning
Fei Cheng | Masayuki Asahara | Ichiro Kobayashi | Sadao Kurohashi
Findings of the Association for Computational Linguistics: EMNLP 2020

Temporal relation classification is the pair-wise task for identifying the relation of a temporal link (TLINKs) between two mentions, i.e. event, time and document creation time (DCT). It leads to two crucial limits: 1) Two TLINKs involving a common mention do not share information. 2) Existing models with independent classifiers for each TLINK category (E2E, E2T and E2D) hinder from using the whole data. This paper presents an event centric model that allows to manage dynamic event representations across multiple TLINKs. Our model deals with three TLINK categories with multi-task learning to leverage the full size of data. The experimental results show that our proposal outperforms state-of-the-art models and two strong transfer learning baselines on both the English and Japanese data.

pdf bib abs
BERT-based Cohesion Analysis of Japanese Texts
Nobuhiro Ueda | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

The meaning of natural language text is supported by cohesion among various kinds of entities, including coreference relations, predicate-argument structures, and bridging anaphora relations. However, predicate-argument structures for nominal predicates and bridging anaphora relations have not been studied well, and their analyses have been still very difficult. Recent advances in neural networks, in particular, self training-based language models including BERT (Devlin et al., 2019), have significantly improved many natural language processing tasks, making it possible to dive into the study on analysis of cohesion in the whole text. In this study, we tackle an integrated analysis of cohesion in Japanese texts. Our results significantly outperformed existing studies in each task, especially about 10 to 20 point improvement both for zero anaphora and coreference resolution. Furthermore, we also showed that coreference resolution is different in nature from the other tasks and should be treated specially.

pdf bib abs
Native-like Expression Identification by Contrasting Native and Proficient Second Language Speakers
Oleksandr Harust | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

We propose a novel task of native-like expression identification by contrasting texts written by native speakers and those by proficient second language speakers. This task is highly challenging mainly because 1) the combinatorial nature of expressions prevents us from choosing candidate expressions a priori and 2) the distributions of the two types of texts overlap considerably. Our solution to the first problem is to combine a powerful neural network-based classifier of sentence-level nativeness with an explainability method that measures an approximate contribution of a given expression to the classifier’s prediction. To address the second problem, we introduce a special label neutral and reformulate the classification task as complementary-label learning. Our crowdsourcing-based evaluation and in-depth analysis suggest that our method successfully uncovers linguistically interesting usages distinctive of native speech.

pdf bib abs
Adapting BERT to Implicit Discourse Relation Classification with a Focus on Discourse Connectives
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

BERT, a neural network-based language model pre-trained on large corpora, is a breakthrough in natural language processing, significantly outperforming previous state-of-the-art models in numerous tasks. However, there have been few reports on its application to implicit discourse relation classification, and it is not clear how BERT is best adapted to the task. In this paper, we test three methods of adaptation. (1) We perform additional pre-training on text tailored to discourse classification. (2) In expectation of knowledge transfer from explicit discourse relations to implicit discourse relations, we add a task named explicit connective prediction at the additional pre-training step. (3) To exploit implicit connectives given by treebank annotators, we add a task named implicit connective prediction at the fine-tuning step. We demonstrate that these three techniques can be combined straightforwardly in a single training pipeline. Through comprehensive experiments, we found that the first and second techniques provide additional gain while the last one did not.

pdf bib abs
Acquiring Social Knowledge about Personality and Driving-related Behavior
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

In this paper, we introduce our psychological approach to collect human-specific social knowledge from a text corpus, using NLP techniques. It is often not explicitly described but shared among people, which we call social knowledge. We focus on the social knowledge, especially personality and driving. We used the language resources that were developed based on psychological research methods; a Japanese personality dictionary (317 words) and a driving experience corpus (8,080 sentences) annotated with behavior and subjectivity. Using them, we automatically extracted collocations between personality descriptors and driving-related behavior from a driving behavior and subjectivity corpus (1,803,328 sentences after filtering) and obtained unique 5,334 collocations. To evaluate the collocations as social knowledge, we designed four step-by-step crowdsourcing tasks. They resulted in 266 pieces of social knowledge. They include the knowledge that might be difficult to recall by themselves but easy to agree with. We discuss the acquired social knowledge and the contribution to implementations into systems.

pdf bib abs
Development of a Japanese Personality Dictionary based on Psychological Methods
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

We propose a new approach to constructing a personality dictionary with psychological evidence. In this study, we collect personality words, using word embeddings, and construct a personality dictionary with weights for Big Five traits. The weights are calculated based on the responses of the large sample (N=1,938, female = 1,004, M=49.8years old:20-78, SD=16.3). All the respondents answered a 20-item personality questionnaire and 537 personality items derived from word embeddings. We present the procedures to examine the qualities of responses with psychological methods and to calculate the weights. These result in a personality dictionary with two sub-dictionaries. We also discuss an application of the acquired resources.

pdf bib abs
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Haiyue Song | Raj Dabre | Atsushi Fujita | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For Japanese–English lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we have released our code for parallel data creation.

pdf bib abs
JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation
Zhuoyuan Mao | Fabien Cromieres | Raj Dabre | Haiyue Song | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

Neural machine translation (NMT) needs large parallel corpora for state-of-the-art translation quality. Low-resource NMT is typically addressed by transfer learning which leverages large monolingual or parallel corpora for pre-training. Monolingual pre-training approaches such as MASS (MAsked Sequence to Sequence) are extremely effective in boosting NMT quality for languages with small parallel corpora. However, they do not account for linguistic information obtained using syntactic analyzers which is known to be invaluable for several Natural Language Processing (NLP) tasks. To this end, we propose JASS, Japanese-specific Sequence to Sequence, as a novel pre-training alternative to MASS for NMT involving Japanese as the source or target language. JASS is joint BMASS (Bunsetsu MASS) and BRSS (Bunsetsu Reordering Sequence to Sequence) pre-training which focuses on Japanese linguistic units called bunsetsus. In our experiments on ASPEC Japanese–English and News Commentary Japanese–Russian translation we show that JASS can give results that are competitive with if not better than those given by MASS. Furthermore, we show for the first time that joint MASS and JASS pre-training gives results that significantly surpass the individual methods indicating their complementary nature. We will release our code, pre-trained models and bunsetsu annotated data as resources for researchers to use in their own NLP tasks.

pdf bib abs
Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases
Shuntaro Yada | Ayami Joh | Ribeka Tanaka | Fei Cheng | Eiji Aramaki | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

Applying natural language processing (NLP) to medical and clinical texts can bring important social benefits by mining valuable information from unstructured text. A popular application for that purpose is named entity recognition (NER), but the annotation policies of existing clinical corpora have not been standardized across clinical texts of different types. This paper presents an annotation guideline aimed at covering medical documents of various types such as radiography interpretation reports and medical records. Furthermore, the annotation was designed to avoid burdensome requirements related to medical knowledge, thereby enabling corpus development without medical specialists. To achieve these design features, we specifically focus on critical lung diseases to stabilize linguistic patterns in corpora. After annotating around 1100 electronic medical records following the annotation scheme, we demonstrated its feasibility using an NER task. Results suggest that our guideline is applicable to large-scale clinical NLP projects.

pdf bib abs
Building a Japanese Typo Dataset from Wikipedia’s Revision History
Yu Tanaka | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

pdf bib abs
Pre-training via Leveraging Assisting Languages for Neural Machine Translation
Haiyue Song | Raj Dabre | Zhuoyuan Mao | Fei Cheng | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Sequence-to-sequence (S2S) pre-training using large monolingual data is known to improve performance for various S2S NLP tasks. However, large monolingual corpora might not always be available for the languages of interest (LOI). Thus, we propose to exploit monolingual corpora of other languages to complement the scarcity of monolingual corpora for the LOI. We utilize script mapping (Chinese to Japanese) to increase the similarity (number of cognates) between the monolingual corpora of helping languages and LOI. An empirical case study of low-resource Japanese-English neural machine translation (NMT) reveals that leveraging large Chinese and French monolingual corpora can help overcome the shortage of Japanese and English monolingual corpora, respectively, for S2S pre-training. Using only Chinese and French monolingual corpora, we were able to improve Japanese-English translation quality by up to 8.5 BLEU in low-resource scenarios.

2019

pdf bib abs
Kyoto University Participation to the WMT 2019 News Shared Task
Fabien Cromieres | Sadao Kurohashi
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We describe here the experiments we did for the the news translation shared task of WMT 2019. We focused on the new German-to-French language direction, and mostly used current standard approaches to develop a Neural Machine Translation system. We make use of the Tensor2Tensor implementation of the Transformer model. After carefully cleaning the data and noting the importance of the good use of recent monolingual data for the task, we obtain our final result by combining the output of a diverse set of trained models through the use of their “checkpoint agreement”.

pdf bib
Applying Machine Translation to Psychology: Automatic Translation of Personality Adjectives
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib abs
Improving Event Coreference Resolution by Learning Argument Compatibility from Unlabeled Data
Yin Jou Huang | Jing Lu | Sadao Kurohashi | Vincent Ng
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Argument compatibility is a linguistic condition that is frequently incorporated into modern event coreference resolution systems. If two event mentions have incompatible arguments in any of the argument roles, they cannot be coreferent. On the other hand, if these mentions have compatible arguments, then this may be used as information towards deciding their coreferent status. One of the key challenges in leveraging argument compatibility lies in the paucity of labeled data. In this work, we propose a transfer learning framework for event coreference resolution that utilizes a large amount of unlabeled data to learn argument compatibility of event mentions. In addition, we adopt an interactive inference network based model to better capture the compatible and incompatible relations between the context words of event mentions. Our experiments on the KBP 2017 English dataset confirm the effectiveness of our model in learning argument compatibility, which in turn improves the performance of the overall event coreference model.

pdf bib abs
Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

For languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one: it uses less than 15 megabytes of space.

pdf bib abs
Minimally Supervised Learning of Affective Events Using Discourse Relations
Jun Saito | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recognizing affective events that trigger positive or negative sentiment has a wide range of natural language processing applications but remains a challenging problem mainly because the polarity of an event is not necessarily predictable from its constituent words. In this paper, we propose to propagate affective polarity using discourse relations. Our method is simple and only requires a very small seed lexicon and a large raw corpus. Our experiments using Japanese data show that our method learns affective events effectively without manually labeled data. It also improves supervised learning results when labeled data are small.

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task. For the WAT2019, 25 teams participated in the shared tasks. We also received 10 research paper submissions out of which 61 were accepted. About 400 translation results were submitted to the automatic evaluation server, and selected submis- sions were manually evaluated.

pdf bib abs
Machine Comprehension Improves Domain-Specific Japanese Predicate-Argument Structure Analysis
Norio Takahashi | Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

To improve the accuracy of predicate-argument structure (PAS) analysis, large-scale training data and knowledge for PAS analysis are indispensable. We focus on a specific domain, specifically Japanese blogs on driving, and construct two wide-coverage datasets as a form of QA using crowdsourcing: a PAS-QA dataset and a reading comprehension QA (RC-QA) dataset. We train a machine comprehension (MC) model based on these datasets to perform PAS analysis. Our experiments show that a stepwise training method is the most effective, which pre-trains an MC model based on the RC-QA dataset to acquire domain knowledge and then fine-tunes based on the PAS-QA dataset.

pdf bib abs
Diversity-aware Event Prediction based on a Conditional Variational Autoencoder with Reconstruction
Hirokazu Kiyomaru | Kazumasa Omura | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Typical event sequences are an important class of commonsense knowledge. Formalizing the task as the generation of a next event conditioned on a current event, previous work in event prediction employs sequence-to-sequence (seq2seq) models. However, what can happen after a given event is usually diverse, a fact that can hardly be captured by deterministic models. In this paper, we propose to incorporate a conditional variational autoencoder (CVAE) into seq2seq for its ability to represent diverse next events as a probabilistic distribution. We further extend the CVAE-based seq2seq with a reconstruction mechanism to prevent the model from concentrating on highly typical events. To facilitate fair and systematic evaluation of the diversity-aware models, we also extend existing evaluation datasets by tying each current event to multiple next events. Experiments show that the CVAE-based models drastically outperform deterministic models in terms of precision and that the reconstruction mechanism improves the recall of CVAE-based models without sacrificing precision.

2018

pdf bib abs
Juman++: A Morphological Analysis Toolkit for Scriptio Continua
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a three-part toolkit for developing morphological analyzers for languages without natural word boundaries. The first part is a C++11/14 lattice-based morphological analysis library that uses a combination of linear and recurrent neural net language models for analysis. The other parts are a tool for exposing problems in the trained model and a partial annotation tool. Our morphological analyzer of Japanese achieves new SOTA on Jumandic-based corpora while being 250 times faster than the previous one. We also perform a small experiment and quantitive analysis and experience of using development tools. All components of the toolkit is open source and available under a permissive Apache 2 License.

pdf bib abs
Neural Adversarial Training for Semi-supervised Japanese Predicate-argument Structure Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Japanese predicate-argument structure (PAS) analysis involves zero anaphora resolution, which is notoriously difficult. To improve the performance of Japanese PAS analysis, it is straightforward to increase the size of corpora annotated with PAS. However, since it is prohibitively expensive, it is promising to take advantage of a large amount of raw corpora. In this paper, we propose a novel Japanese PAS analysis model based on semi-supervised adversarial training with a raw corpus. In our experiments, our model outperforms existing state-of-the-art models for Japanese PAS analysis.

pdf bib abs
Entity-Centric Joint Modeling of Japanese Coreference Resolution and Predicate Argument Structure Analysis
Tomohide Shibata | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Predicate argument structure analysis is a task of identifying structured events. To improve this field, we need to identify a salient entity, which cannot be identified without performing coreference resolution and predicate argument structure analysis simultaneously. This paper presents an entity-centric joint model for Japanese coreference resolution and predicate argument structure analysis. Each entity is assigned an embedding, and when the result of both analyses refers to an entity, the entity embedding is updated. The analyses take the entity embedding into consideration to access the global information of entities. Our experimental results demonstrate the proposed method can improve the performance of the inter-sentential zero anaphora resolution drastically, which is a notoriously difficult task in predicate argument structure analysis.

pdf bib abs
A Knowledge-Augmented Neural Network Model for Implicit Discourse Relation Classification
Yudai Kishimoto | Yugo Murawaki | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Identifying discourse relations that are not overtly marked with discourse connectives remains a challenging problem. The absence of explicit clues indicates a need for the combination of world knowledge and weak contextual clues, which can hardly be learned from a small amount of manually annotated data. In this paper, we address this problem by augmenting the input text with external knowledge and context and by adopting a neural network model that can effectively handle the augmented text. Experiments show that external knowledge did improve the classification accuracy. Contextual information provided no significant gain for implicit discourse relations, but it did for explicit ones.

pdf bib abs
Cross-lingual Knowledge Projection Using Machine Translation and Target-side Knowledge Base Completion
Naoki Otani | Hirokazu Kiyomaru | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Considerable effort has been devoted to building commonsense knowledge bases. However, they are not available in many languages because the construction of KBs is expensive. To bridge the gap between languages, this paper addresses the problem of projecting the knowledge in English, a resource-rich language, into other languages, where the main challenge lies in projection ambiguity. This ambiguity is partially solved by machine translation and target-side knowledge base completion, but neither of them is adequately reliable by itself. We show their combination can project English commonsense knowledge into Japanese and Chinese with high precision. Our method also achieves a top-10 accuracy of 90% on the crowdsourced English–Japanese benchmark. Furthermore, we use our method to obtain 18,747 facts of accurate Japanese commonsense within a very short period.

pdf bib abs
Knowledge-Enriched Two-Layered Attention Network for Sentiment Analysis
Abhishek Kumar | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised model based on Support Vector Regression using a Multilayer Perceptron network for sentiment analysis. We evaluate our model on the benchmark dataset of SemEval 2017 Task 5. Experimental results show that the proposed model surpasses the top system of SemEval 2017 Task 5. The model performs significantly better by improving the state-of-the-art system at SemEval 2017 Task 5 by 1.7 and 3.7 points for sub-tracks 1 and 2 respectively.

pdf bib
Annotating a Driving Experience Corpus with Behavior and Subjectivity
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf bib
Comprehensive Annotation of Various Types of Temporal Information on the Time Axis
Tomohiro Sakaguchi | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Improving Crowdsourcing-Based Annotation of Japanese Discourse Relations
Yudai Kishimoto | Shinnosuke Sawada | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Proceedings of the IJCNLP 2017, Tutorial Abstracts
Sadao Kurohashi | Michael Strube
Proceedings of the IJCNLP 2017, Tutorial Abstracts

pdf bib abs
Neural Joint Model for Transition-based Chinese Syntactic Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present neural network-based joint models for Chinese word segmentation, POS tagging and dependency parsing. Our models are the first neural approaches for fully joint Chinese analysis that is known to prevent the error propagation problem of pipeline models. Although word embeddings play a key role in dependency parsing, they cannot be applied directly to the joint task in the previous work. To address this problem, we propose embeddings of character strings, in addition to words. Experiments show that our models outperform existing systems in Chinese word segmentation and POS tagging, and perform preferable accuracies in dependency parsing. We also explore bi-LSTM models with fewer features.

pdf bib abs
An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we propose a novel domain adaptation method named “mixed fine tuning” for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.

pdf bib abs
Kyoto University MT System Description for IWSLT 2017
Raj Dabre | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 14th International Conference on Spoken Language Translation

We describe here our Machine Translation (MT) model and the results we obtained for the IWSLT 2017 Multilingual Shared Task. Motivated by Zero Shot NMT [1] we trained a Multilingual Neural Machine Translation by combining all the training data into one single collection by appending the tokens to the source sentences in order to indicate the target language they should be translated to. We observed that even in a low resource situation we were able to get translations whose quality surpass the quality of those obtained by Phrase Based Statistical Machine Translation by several BLEU points. The most surprising result we obtained was in the zero shot setting for Dutch-German and Italian-Romanian where we observed that despite using no parallel corpora between these language pairs, the NMT model was able to translate between these languages and the translations were either as good as or better (in terms of BLEU) than the non zero resource setting. We also verify that the NMT models that use feed forward layers and self attention instead of recurrent layers are extremely fast in terms of training which is useful in a NMT experimental setting.

pdf bib abs
Improving Shared Argument Identification in Japanese Event Knowledge Acquisition
Yin Jou Huang | Sadao Kurohashi
Proceedings of the Events and Stories in the News Workshop

Event knowledge represents the knowledge of causal and temporal relations between events. Shared arguments of event knowledge encode patterns of role shifting in successive events. A two-stage framework was proposed for the task of Japanese event knowledge acquisition, in which related event pairs are first extracted, and shared arguments are then identified to form the complete event knowledge. This paper focuses on the second stage of this framework, and proposes a method to improve the shared argument identification of related event pairs. We constructed a gold dataset for shared argument learning. By evaluating our system on this gold dataset, we found that our proposed model outperformed the baseline models by a large margin.

pdf bib abs
Automatic Extraction of High-Quality Example Sentences for Word Learning Using a Determinantal Point Process
Arseny Tolmachev | Sadao Kurohashi
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Flashcard systems are effective tools for learning words but have their limitations in teaching word usage. To overcome this problem, we propose a novel flashcard system that shows a new example sentence on each repetition. This extension requires high-quality example sentences, automatically extracted from a huge corpus. To do this, we use a Determinantal Point Process which scales well to large data and allows to naturally represent sentence similarity and quality as features. Our human evaluation experiment on Japanese language indicates that the proposed method successfully extracted high-quality example sentences.

This paper presents the results of the shared tasks from the 4th workshop on Asian translation (WAT2017) including J↔E, J↔C scientific paper translation subtasks, C↔J, K↔J, E↔J patent translation subtasks, H↔E mixed domain subtasks, J↔E newswire subtasks and J↔E recipe subtasks. For the WAT2017, 12 institutions participated in the shared tasks. About 300 translation results have been submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Kyoto University Participation to WAT 2017
Fabien Cromieres | Raj Dabre | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

We describe here our approaches and results on the WAT 2017 shared translation tasks. Following our good results with Neural Machine Translation in the previous shared task, we continue this approach this year, with incremental improvements in models and training methods. We focused on the ASPEC dataset and could improve the state-of-the-art results for Chinese-to-Japanese and Japanese-to-Chinese translations.

pdf bib abs
Automatically Acquired Lexical Knowledge Improves Japanese Joint Morphological and Dependency Analysis
Daisuke Kawahara | Yuta Hayashibe | Hajime Morita | Sadao Kurohashi
Proceedings of the 15th International Conference on Parsing Technologies

This paper presents a joint model for morphological and dependency analysis based on automatically acquired lexical knowledge. This model takes advantage of rich lexical knowledge to simultaneously resolve word segmentation, POS, and dependency ambiguities. In our experiments on Japanese, we show the effectiveness of our joint model over conventional pipeline models.

pdf bib abs
Improving Chinese Semantic Role Labeling using High-quality Surface and Deep Case Frames
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a method for applying automatically acquired knowledge to semantic role labeling (SRL). We use a large amount of automatically extracted knowledge to improve the performance of SRL. We present two varieties of knowledge, which we call surface case frames and deep case frames. Although the surface case frames are compiled from syntactic parses and can be used as rich syntactic knowledge, they have limited capability for resolving semantic ambiguity. To compensate the deficiency of the surface case frames, we compile deep case frames from automatic semantic roles. We also consider quality management for both types of knowledge in order to get rid of the noise brought from the automatic analyses. The experimental results show that Chinese SRL can be improved using automatically acquired knowledge and the quality management shows a positive effect on this task.

2016

pdf bib
Flexible Non-Terminals for Dependency Tree-to-Tree Reordering
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Design of Word Association Games using Dialog Systems for Acquisition of Word Association Knowledge
Yuichiro Machida | Daisuke Kawahara | Sadao Kurohashi | Manabu Sassano
Proceedings of the 5th Workshop on Automated Knowledge Base Construction

pdf bib
Cross-language Projection of Dependency Trees with Constrained Partial Parsing for Tree-to-Tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi
Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers

pdf bib
The Kyoto University Cross-Lingual Pronoun Translation System
Raj Dabre | Yevgeniy Puzikov | Fabien Cromieres | Sadao Kurohashi
Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers

pdf bib abs
Large-Scale Acquisition of Commonsense Knowledge via a Quiz Game on a Dialogue System
Naoki Otani | Daisuke Kawahara | Sadao Kurohashi | Nobuhiro Kaji | Manabu Sassano
Proceedings of the Open Knowledge Base and Question Answering Workshop (OKBQA 2016)

Commonsense knowledge is essential for fully understanding language in many situations. We acquire large-scale commonsense knowledge from humans using a game with a purpose (GWAP) developed on a smartphone spoken dialogue system. We transform the manual knowledge acquisition process into an enjoyable quiz game and have collected over 150,000 unique commonsense facts by gathering the data of more than 70,000 players over eight months. In this paper, we present a simple method for maintaining the quality of acquired knowledge and an empirical analysis of the knowledge acquisition process. To the best of our knowledge, this is the first work to collect large-scale knowledge via a GWAP on a widely-used spoken dialogue system.

This paper presents the results of the shared tasks from the 3rd workshop on Asian translation (WAT2016) including J ↔ E, J ↔ C scientific paper translation subtasks, C ↔ J, K ↔ J, E ↔ J patent translation subtasks, I ↔ E newswire subtasks and H ↔ E, H ↔ J mixed domain subtasks. For the WAT2016, 15 institutions participated in the shared tasks. About 500 translation results have been submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Kyoto University Participation to WAT 2016
Fabien Cromieres | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

We describe here our approaches and results on the WAT 2016 shared translation tasks. We tried to use both an example-based machine translation (MT) system and a neural MT system. We report very good translation results, especially when using neural MT for Chinese-to-Japanese translation.

pdf bib abs
SCTB: A Chinese Treebank in Scientific Domain
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Treebanks are curial for natural language processing (NLP). In this paper, we present our work for annotating a Chinese treebank in scientific domain (SCTB), to address the problem of the lack of Chinese treebanks in this domain. Chinese analysis and machine translation experiments conducted using this treebank indicate that the annotated treebank can significantly improve the performance on both tasks. This treebank is released to promote Chinese NLP research in scientific domain.

pdf bib abs
Consistent Word Segmentation, Part-of-Speech Tagging and Dependency Labelling Annotation for Chinese Language
Mo Shen | Wingmui Li | HyunJeong Choe | Chenhui Chu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

In this paper, we propose a new annotation approach to Chinese word segmentation, part-of-speech (POS) tagging and dependency labelling that aims to overcome the two major issues in traditional morphology-based annotation: Inconsistency and data sparsity. We re-annotate the Penn Chinese Treebank 5.0 (CTB5) and demonstrate the advantages of this approach compared to the original CTB5 annotation through word segmentation, POS tagging and machine translation experiments.

pdf bib
IRT-based Aggregation Model of Crowdsourced Pairwise Comparison for Evaluating Machine Translations
Naoki Otani | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Insertion Position Selection Model for Flexible Non-Terminals in Dependency Tree-to-Tree Machine Translation
Toshiaki Nakazawa | John Richardson | Sadao Kurohashi
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Neural Network-Based Model for Japanese Predicate Argument Structure Analysis
Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Dependency Forest based Word Alignment
Hitoshi Otsuki | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the ACL 2016 Student Research Workshop

pdf bib abs
Paraphrasing Out-of-Vocabulary Words with Word Embeddings and Semantic Lexicons for Low Resource Statistical Machine Translation
Chenhui Chu | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Out-of-vocabulary (OOV) word is a crucial problem in statistical machine translation (SMT) with low resources. OOV paraphrasing that augments the translation model for the OOV words by using the translation knowledge of their paraphrases has been proposed to address the OOV problem. In this paper, we propose using word embeddings and semantic lexicons for OOV paraphrasing. Experiments conducted on a low resource setting of the OLYMPICS task of IWSLT 2012 verify the effectiveness of our proposed method.

pdf bib abs
Simultaneous Sentence Boundary Detection and Alignment with Pivot-based Machine Translation Generated Lexicons
Antoine Bourlon | Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Sentence alignment is a task that consists in aligning the parallel sentences in a translated article pair. This paper describes a method to perform sentence boundary detection and alignment simultaneously, which significantly improves the alignment accuracy on languages like Chinese with uncertain sentence boundaries. It relies on the definition of hard (certain) and soft (uncertain) punctuation delimiters, the latter being possibly ignored to optimize the alignment result. The alignment method is used in combination with lexicons automatically generated from the input article pairs using pivot-based MT, achieving better coverage of the input words with fewer entries than pre-existing dictionaries. Pivot-based MT makes it possible to build dictionaries for language pairs that have scarce parallel data. The alignment method is implemented in a tool that will be freely available in the near future.

In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).

pdf bib abs
Parallel Sentence Extraction from Comparable Corpora with Neural Network Features
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Parallel corpora are crucial for machine translation (MT), however they are quite scarce for most language pairs and domains. As comparable corpora are far more available, many studies have been conducted to extract parallel sentences from them for MT. In this paper, we exploit the neural network features acquired from neural MT for parallel sentence extraction. We observe significant improvements for both accuracy in sentence extraction and MT performance.

pdf bib
M2L at SemEval-2016 Task 8: AMR Parsing with Neural Networks
Yevgeniy Puzikov | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Classification and Acquisition of Contradictory Event Pairs using Crowdsourcing
Yu Takabatake | Hajime Morita | Daisuke Kawahara | Sadao Kurohashi | Ryuichiro Higashinaka | Yoshihiro Matsuo
Proceedings of the The 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation

pdf bib
Location Name Disambiguation Exploiting Spatial Proximity and Temporal Consistency
Takashi Awamura | Daisuke Kawahara | Eiji Aramaki | Tomohide Shibata | Sadao Kurohashi
Proceedings of the third International Workshop on Natural Language Processing for Social Media

pdf bib
Chinese Semantic Role Labeling using High-quality Syntactic Knowledge
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

pdf bib
Large-scale Dictionary Construction via Pivot-based Statistical Machine Translation with Significance Pruning and Neural Network Features
Raj Dabre | Chenhui Chu | Fabien Cromieres | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Pivot-Based Topic Models for Low-Resource Lexicon Extraction
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf bib
Cross-language Projection of Dependency Trees for Tree-to-tree Machine Translation
Yu Shen | Chenhui Chu | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

pdf bib
Korean-Chinese word translation using Chinese character knowledge
Yuanmei Lu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of Machine Translation Summit XV: Papers

pdf bib
Enhancing function word translation with syntax-based statistical post-editing
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 6th Workshop on Patent and Scientific Literature Translation

pdf bib
Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model
Hajime Morita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Leveraging Small Multilingual Corpora for SMT Using Many Pivot Languages
Raj Dabre | Fabien Cromieres | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf bib abs
Post-editing user interface using visualization of a sentence structure
Yudai Kishimoto | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas

Translation has become increasingly important by virtue of globalization. To reduce the cost of translation, it is necessary to use machine translation and further to take advantage of post-editing based on the result of a machine translation for accurate information dissemination. Such post-editing (e.g., PET [Aziz et al., 2012]) can be used practically for translation between European languages, which has a high performance in statistical machine translation. However, due to the low accuracy of machine translation between languages with different word order, such as Japanese-English and Japanese-Chinese, post-editing has not been used actively.

pdf bib abs
Bilingual Dictionary Construction with Transliteration Filtering
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we present a bilingual transliteration lexicon of 170K Japanese-English technical terms in the scientific domain. Translation pairs are extracted by filtering a large list of transliteration candidates generated automatically from a phrase table trained on parallel corpora. Filtering uses a novel transliteration similarity measure based on a discriminative phrase-based machine translation approach. We demonstrate that the extracted dictionary is accurate and of high recall (F1 score 0.8). Our lexicon contains not only single words but also multi-word expressions, and is freely available. Our experiments focus on Katakana-English lexicon construction, however it would be possible to apply the proposed methods to transliteration extraction for a variety of language pairs.

pdf bib abs
A Large Scale Database of Strongly-related Events in Japanese
Tomohide Shibata | Shotaro Kohama | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The knowledge about the relation between events is quite useful for coreference resolution, anaphora resolution, and several NLP applications such as dialogue system. This paper presents a large scale database of strongly-related events in Japanese, which has been acquired with our proposed method (Shibata and Kurohashi, 2011). In languages, where omitted arguments or zero anaphora are often utilized, such as Japanese, the coreference-based event extraction methods are hard to be applied, and so our method extracts strongly-related events in a two-phrase construct. This method first calculates the co-occurrence measure between predicate-arguments (events), and regards an event pair, whose mutual information is high, as strongly-related events. To calculate the co-occurrence measure efficiently, we adopt an association rule mining method. Then, we identify the remaining arguments by using case frames. The database contains approximately 100,000 unique events, with approximately 340,000 strongly-related event pairs, which is much larger than an existing automatically-constructed event database. We evaluated randomly-chosen 100 event pairs, and the accuracy was approximately 68%.

pdf bib abs
Constructing a Chinese—Japanese Parallel Corpus from Wikipedia
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Parallel corpora are crucial for statistical machine translation (SMT). However, they are quite scarce for most language pairs, such as Chinese―Japanese. As comparable corpora are far more available, many studies have been conducted to automatically construct parallel corpora from comparable corpora. This paper presents a robust parallel sentence extraction system for constructing a Chinese―Japanese parallel corpus from Wikipedia. The system is inspired by previous studies that mainly consist of a parallel sentence candidate filter and a binary classifier for parallel sentence identification. We improve the system by using the common Chinese characters for filtering and two novel feature sets for classification. Experiments show that our system performs significantly better than the previous studies for both accuracy in parallel sentence extraction and SMT performance. Using the system, we construct a Chinese―Japanese parallel corpus with more than 126k highly accurate parallel sentences from Wikipedia. The constructed parallel corpus is freely available at http://orchid.kuee.kyoto-u.ac.jp/~chu/resource/wiki_zh_ja.tgz.

pdf bib abs
Constructing a Corpus of Japanese Predicate Phrases for Synonym/Antonym Relations
Tomoko Izumi | Tomohide Shibata | Hisako Asano | Yoshihiro Matsuo | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We construct a large corpus of Japanese predicate phrases for synonym-antonym relations. The corpus consists of 7,278 pairs of predicates such as receive-permission (ACC) vs. obtain-permission (ACC), in which each predicate pair is accompanied by a noun phrase and case information. The relations are categorized as synonyms, entailment, antonyms, or unrelated. Antonyms are further categorized into three different classes depending on their aspect of oppositeness. Using the data as a training corpus, we conduct the supervised binary classification of synonymous predicates based on linguistically-motivated features. Combining features that are characteristic of synonymous predicates with those that are characteristic of antonymous predicates, we succeed in automatically identifying synonymous predicates at the high F-score of 0.92, a 0.4 improvement over the baseline method of using the Japanese WordNet. The results of an experiment confirm that the quality of the corpus is high enough to achieve automatic classification. To the best of our knowledge, this is the first and the largest publicly available corpus of Japanese predicate phrases for synonym-antonym relations.

pdf bib abs
A Framework for Compiling High Quality Knowledge Resources From Raw Corpora
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The identification of various types of relations is a necessary step to allow computers to understand natural language text. In particular, the clarification of relations between predicates and their arguments is essential because predicate-argument structures convey most of the information in natural languages. To precisely capture these relations, wide-coverage knowledge resources are indispensable. Such knowledge resources can be derived from automatic parses of raw corpora, but unfortunately parsing still has not achieved a high enough performance for precise knowledge acquisition. We present a framework for compiling high quality knowledge resources from raw corpora. Our proposed framework selects high quality dependency relations from automatic parses and makes use of them for not only the calculation of fundamental distributional similarity but also the acquisition of knowledge such as case frames.

pdf bib
Proceedings of the 1st Workshop on Asian Translation (WAT2014)
Toshiaki Nakazawa | Hideya Mino | Isao Goto | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
Overview of the 1st Workshop on Asian Translation
Toshiaki Nakazawa | Hideya Mino | Isao Goto | Sadao Kurohashi | Eiichiro Sumita
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
KyotoEBMT System Description for the 1st Workshop on Asian Translation
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 1st Workshop on Asian Translation (WAT2014)

pdf bib
Chinese Morphological Analysis with Character-level POS Tagging
Mo Shen | Hongxiao Liu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
KyotoEBMT: An Example-Based Dependency-to-Dependency Translation Framework
John Richardson | Fabien Cromières | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Improving Statistical Machine Translation Accuracy Using Bilingual Lexicon Extractionwith Paraphrases
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf bib
Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing
Daisuke Kawahara | Yuichiro Machida | Tomohide Shibata | Sadao Kurohashi | Hayato Kobayashi | Manabu Sassano
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Translation Rules with Right-Hand Side Lattices
Fabien Cromières | Sadao Kurohashi
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

2013

pdf bib
Japanese Zero Reference Resolution Considering Exophora and Author/Reader Mentions
Masatsugu Hangyo | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Automatic Knowledge Acquisition for Case Alternation between the Passive and Active Voices in Japanese
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi | Manabu Okumura
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth Workshop on Building and Using Comparable Corpora

pdf bib
Towards Fully Lexicalized Dependency Parsing for Korean
Jungyeul Park | Daisuke Kawahara | Sadao Kurohashi | Key-Sun Choi
Proceedings of the 13th International Conference on Parsing Technologies (IWPT 2013)

pdf bib
Precise Information Retrieval Exploiting Predicate-Argument Structures
Daisuke Kawahara | Keiji Shinzato | Tomohide Shibata | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
A Simple Approach to Unknown Word Processing in Japanese Morphological Analysis
Ryohei Sasano | Sadao Kurohashi | Manabu Okumura
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Chinese Word Segmentation by Mining Maximized Substrings
Mo Shen | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Robust Transliteration Mining from Comparable Corpora with Bilingual Topic Models
John Richardson | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
High Quality Dependency Selection from Automatic Parses
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Accurate Parallel Fragment Extraction from Quasi–Comparable Corpora using Alignment Model and Translation Lexicon
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Distortion Model Considering Rich Context for Statistical Machine Translation
Isao Goto | Masao Utiyama | Eiichiro Sumita | Akihiro Tamura | Sadao Kurohashi
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf bib
A Reranking Approach for Dependency Parsing with Variable-sized Subtree Features
Mo Shen | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
Building a Diverse Document Leads Corpus Annotated with Semantic Relations
Masatsugu Hangyo | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf bib
Constrained Hidden Markov Model for Bilingual Keyword Pairs Alignment
Denny Cahyadi | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 10th Workshop on Asian Language Resources

pdf bib
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation
Chenhui Chu | Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 16th Annual conference of the European Association for Machine Translation

pdf bib
Flexible Japanese Sentence Compression by Relaxing Unit Constraints
Jun Harashima | Sadao Kurohashi
Proceedings of COLING 2012

pdf bib
Semi-Supervised Noun Compound Analysis with Edge and Span Features
Yugo Murawaki | Sadao Kurohashi
Proceedings of COLING 2012

pdf bib
Alignment by Bilingual Generation and Monolingual Derivation
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of COLING 2012

pdf bib abs
EBMT system of Kyoto University in OLYMPICS task at IWSLT 2012
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign

This paper describes the EBMT system of Kyoto University that participated in the OLYMPICS task at IWSLT 2012. When translating very different language pairs such as Chinese-English, it is very important to handle sentences in tree structures to overcome the difference. Many recent studies incorporate tree structures in some parts of translation process, but not all the way from model training (alignment) to decoding. Our system is a fully tree-based translation system where we use the Bayesian phrase alignment model on dependency trees and example-based translation. To improve the translation quality, we conduct some special processing for the IWSLT 2012 OLYMPICS task, including sub-sentence splitting, non-parallel sentence filtering, adoption of an optimized Chinese segmenter and rule-based decoding constraints.

pdf bib abs
Chinese Characters Mapping Table of Japanese, Traditional Chinese and Simplified Chinese
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Chinese characters are used both in Japanese and Chinese, which are called Kanji and Hanzi respectively. Chinese characters contain significant semantic information, a mapping table between Kanji and Hanzi can be very useful for many Japanese-Chinese bilingual applications, such as machine translation and cross-lingual information retrieval. Because Kanji characters are originated from ancient China, most Kanji have corresponding Chinese characters in Hanzi. However, the relation between Kanji and Hanzi is quite complicated. In this paper, we propose a method of making a Chinese characters mapping table of Japanese, Traditional Chinese and Simplified Chinese automatically by means of freely available resources. We define seven categories for Kanji based on the relation between Kanji and Hanzi, and classify mappings of Chinese characters into these categories. We use a resource from Wiktionary to show the completeness of the mapping table we made. Statistical comparison shows that our proposed method makes a more complete mapping table than the current version of Wiktionary.

2011

pdf bib
Extracting Paraphrases from Definition Sentences on the Web
Chikara Hashimoto | Kentaro Torisawa | Stijn De Saeger | Jun’ichi Kazama | Sadao Kurohashi
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the ACL-HLT 2011 System Demonstrations
Sadao Kurohashi
Proceedings of the ACL-HLT 2011 System Demonstrations

pdf bib
Generative Modeling of Coordination by Factoring Parallelism and Selectional Preferences
Daisuke Kawahara | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
A Discriminative Approach to Japanese Zero Anaphora Resolution with Large-scale Lexicalized Case Frames
Ryohei Sasano | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Bayesian Subtree Alignment Model based on Dependency Trees
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Acquiring Strongly-related Events using Predicate-argument Co-occurring Statistics and Case Frames
Tomohide Shibata | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Relevance Feedback using Latent Information
Jun Harashima | Sadao Kurohashi
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Japanese-Chinese Phrase Alignment Using Common Chinese Characters Information
Chenhui Chu | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of Machine Translation Summit XIII: Papers

pdf bib
Efficient retrieval of tree translation examples for Syntax-Based Machine Translation
Fabien Cromieres | Sadao Kurohashi
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Non-parametric Bayesian Segmentation of Japanese Noun Phrases
Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf bib abs
Online Japanese Unknown Morpheme Detection using Orthographic Variation
Yugo Murawaki | Sadao Kurohashi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

To solve the unknown morpheme problem in Japanese morphological analysis, we previously proposed a novel framework of online unknown morpheme acquisition and its implementation. This framework poses a previously unexplored problem, online unknown morpheme detection. Online unknown morpheme detection is a task of finding morphemes in each sentence that are not listed in a given lexicon. Unlike in English, it is a non-trivial task because Japanese does not delimit words by white space. We first present a baseline method that simply uses the output of the morphological analyzer. We then show that it fails to detect some unknown morphemes because they are over-segmented into shorter registered morphemes. To cope with this problem, we present a simple solution, the use of orthographic variation of Japanese. Under the assumption that orthographic variants behave similarly, each over-segmentation candidate is checked against its counterparts. Experiments show that the proposed method improves the recall of detection and contributes to improving unknown morpheme acquisition.

pdf bib abs
Acquiring Reliable Predicate-argument Structures from Raw Corpora for Case Frame Compilation
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a method for acquiring reliable predicate-argument structures from raw corpora for automatic compilation of case frames. Such lexicon compilation requires highly reliable predicate-argument structures to practically contribute to Natural Language Processing (NLP) applications, such as paraphrasing, text entailment, and machine translation. However, to precisely identify predicate-argument structures, case frames are required. This issue is similar to the question ""what came first: the chicken or the egg?"" In this paper, we propose the first step in the extraction of reliable predicate-argument structures without using case frames. We first apply chunking to raw corpora and then extract reliable chunks to ensure that high-quality predicate-argument structures are obtained from the chunks. We conducted experiments to confirm the effectiveness of our approach. We successfully extracted reliable chunks of an accuracy of 98% and high-quality predicate-argument structures of an accuracy of 97%. Our experiments confirmed that we succeeded in acquiring highly reliable predicate-argument structures that can be used to compile case frames.

pdf bib
Identifying Contradictory and Contrastive Relations between Statements to Outline Web Information on a Given Topic
Daisuke Kawahara | Kentaro Inui | Sadao Kurohashi
Coling 2010: Posters

pdf bib
Semantic Classification of Automatically Acquired Nouns using Lexico-Syntactic Clues
Yugo Murawaki | Sadao Kurohashi
Coling 2010: Posters

pdf bib
Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables
Tetsuji Nakagawa | Kentaro Inui | Sadao Kurohashi
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Using Smaller Constituents Rather Than Sentences in Active Learning for Japanese Dependency Parsing
Manabu Sassano | Sadao Kurohashi
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)
Sadao Kurohashi | Takehito Utsuro
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

pdf bib
Exploiting Term Importance Categories and Dependency Relations for Natural Language Search
Keiji Shinzato | Sadao Kurohashi
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

pdf bib
Summarizing Search Results using PLSI
Jun Harashima | Sadao Kurohashi
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

2009

pdf bib
The Effect of Corpus Size on Case Frame Acquisition for Discourse Analysis
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
An Alignment Algorithm Using Belief Propagation and a Structure-Based Distortion Model
Fabien Cromières | Sadao Kurohashi
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
A Unified Single Scan Algorithm for Japanese Base Phrase Chunking and Dependency Parsing
Manabu Sassano | Sadao Kurohashi
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

pdf bib
A Probabilistic Model for Associative Anaphora Resolution
Ryohei Sasano | Sadao Kurohashi
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Statistical Phrase Alignment Model Using Dependency Relation Probability
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf bib
Bottom-up Named Entity Recognition using Two-stage Machine Learning Method
Hirotaka Funayama | Tomohide Shibata | Sadao Kurohashi
Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications (MWE 2009)

pdf bib
Capturing Consistency between Intra-clause and Inter-clause Relations in Knowledge-rich Dependency and Case Structure Analysis
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 11th International Conference on Parsing Technologies (IWPT’09)

2008

pdf bib abs
A Large-Scale Web Data Collection as a Natural Language Processing Infrastructure
Keiji Shinzato | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In recent years, language resources acquired from theWeb are released, and these data improve the performance of applications in several NLP tasks. Although the language resources based on the web page unit are useful in NLP tasks and applications such as knowledge acquisition, document retrieval and document summarization, such language resources are not released so far. In this paper, we propose a data format for results of web page processing, and a search engine infrastructure which makes it possible to share approximately 100 million Japanese web data. By obtaining the web data, NLP researchers are enabled to begin their own processing immediately without analyzing web pages by themselves.

pdf bib
Coordination Disambiguation without Any Similarities
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
A Fully-Lexicalized Probabilistic Model for Japanese Zero Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Chinese Dependency Parsing with Large Scale Automatically Constructed Case Structures
Kun Yu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Online Acquisition of Japanese Unknown Morphemes using Morphological Constraints
Yugo Murawaki | Sadao Kurohashi
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Blog Categorization Exploiting Domain Dictionary and Dynamically Estimated Domains of Unknown Words
Chikara Hashimoto | Sadao Kurohashi
Proceedings of ACL-08: HLT, Short Papers

pdf bib
TSUBAKI: An Open Search Engine Infrastructure for Developing New Information Access Methodology
Keiji Shinzato | Tomohide Shibata | Daisuke Kawahara | Chikara Hashimoto | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Japanese Named Entity Recognition Using Structural Natural Language Processing
Ryohei Sasano | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib
SYNGRAPH: A Flexible Matching Method based on Synonymous Expression Extraction from an Ordinary Dictionary and a Web Corpus
Tomohide Shibata | Michitaka Odani | Jun Harashima | Takashi Oonishi | Sadao Kurohashi
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf bib abs
Linguistically-motivated Tree-based Probabilistic Phrase Alignment
Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we propose a probabilistic phrase alignment model based on dependency trees. This model is linguistically-motivated, using syntactic information during alignment process. The main advantage of this model is that the linguistic difference between source and target languages is successfully absorbed. It is composed of two models: Model1 is using content word translation probability and function word translation probability; Model2 uses dependency relation probability which is defined for a pair of positional relations on dependency trees. Relation probability acts as tree-based phrase reordering model. Since this model is directed, we combine two alignment results from bi-directional training by symmetrization heuristics to get definitive alignment. We conduct experiments on a Japanese-English corpus, and achieve reasonably high quality of alignment compared with word-based alignment model.

2007

pdf bib
Construction of Domain Dictionary for Fundamental Vocabulary
Chikara Hashimoto | Sadao Kurohashi
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf bib
A Three-Step Deterministic Parser for Chinese Dependency Parsing
Kun Yu | Sadao Kurohashi | Hao Liu
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

pdf bib
Structural phrase alignment based on consistency criteria
Toshiaki Nakazawa | Yu Kun | Sadao Kurohashi
Proceedings of Machine Translation Summit XI: Papers

pdf bib
Probabilistic Coordination Disambiguation in a Fully-Lexicalized Japanese Parser
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2006

pdf bib abs
Case Frame Compilation from the Web using High-Performance Computing
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Case frames are important knowledge for a variety of NLP systems, especially when wide-coverage case frames are available. To acquire such large-scale case frames, it is necessary to automatically compile them from an enormous amount of corpus. In this paper, we consider the web as a corpus. We first build a huge text corpus from the web, and then construct case frames from the corpus. It is infeasible to do these processes by one CPU, and thus we employ a high-performance computing environment composed of 350 CPUs. The acquired corpus consists of 470M sentences, and the case frames compiled from them have 90,000 verb entries. The case frames contain most examples of usual use, and are ready to be applied to lots of NLP analyses and applications.

pdf bib
A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis
Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Example-based machine translation based on deeper NLP
Toshiaki Nakazawa | Kun Yu | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the Third International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
Chinese Word Segmentation and Named Entity Recognition by Character Tagging
Kun Yu | Sadao Kurohashi | Hao Liu | Toshiaki Nakazawa
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

pdf bib
Unsupervised Topic Identification by Integrating Linguistic and Visual Information Based on Hidden Markov Models
Tomohide Shibata | Sadao Kurohashi
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf bib
Example-based Machine Translation Pursuing Fully Structural NLP
Sadao Kurohashi | Toshiaki Nakazawa | Kauffmann Alexis | Daisuke Kawahara
Proceedings of the Second International Workshop on Spoken Language Translation

pdf bib abs
Probabilistic Model for Example-based Machine Translation
Eiji Aramaki | Sadao Kurohashi | Hideki Kashioka | Naoto Kato
Proceedings of Machine Translation Summit X: Papers

Example-based machine translation (EBMT) systems, so far, rely on heuristic measures in retrieving translation examples. Such a heuristic measure costs time to adjust, and might make its algorithm unclear. This paper presents a probabilistic model for EBMT. Under the proposed model, the system searches the translation example combination which has the highest probability. The proposed model clearly formalizes EBMT process. In addition, the model can naturally incorporate the context similarity of translation examples. The experimental results demonstrate that the proposed model has a slightly better translation quality than state-of-the-art EBMT systems.

pdf bib
PP-Attachment Disambiguation Boosted by a Gigantic Volume of Unambiguous Examples
Daisuke Kawahara | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Automatic Acquisition of Basic Katakana Lexicon from a Given Corpus
Toshiaki Nakazawa | Daisuke Kawahara | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Automatic Slide Generation Based on Discourse Structure Analysis
Tomohide Shibata | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

pdf bib
Lexical Choice via Topic Adaptation for Paraphrasing Written Language to Spoken Language
Nobuhiro Kaji | Sadao Kurohashi
Second International Joint Conference on Natural Language Processing: Full Papers

2004

pdf bib
Paraphrasing Predicates from Written Language to Spoken Language Using the Web
Nobuhiro Kaji | Masashi Okamoto | Sadao Kurohashi
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf bib
Improving Japanese Zero Pronoun Resolution by Global Word Sense Disambiguation
Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Automatic Construction of Nominal Case Frames and its Application to Indirect Anaphora Resolution
Ryohei Sasano | Daisuke Kawahara | Sadao Kurohashi
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
Example-based machine translation using structural translation examples
Eiji Aramaki | Sadao Kurohashi
Proceedings of the First International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib abs
Toward Text Understanding: Integrating Relevance-tagged Corpus and Automatically Constructed Case Frames
Daisuke Kawahara | Ryohei Sasano | Sadao Kurohashi
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

This paper proposes a wide-range anaphora resolution system toward text understanding. This system resolves zero, direct and indirect anaphors in Japanese texts by integrating two sorts of linguistic resources: a hand-annotated corpus with various relations and automatically constructed case frames. The corpus has relevance tags which consist of predicate-argument relations, relations between nouns and coreferences, and is utilized for learning parameters of the system and testing it. The case frames are indispensable knowledge both for detecting zero/indirect anaphors and estimating appropriate antecedents. Our preliminary experiments showed promising results.

This paper describes a system for finding phrasal translation correspondences from parallel parsed corpus that are collections paired English and Japanese sentences. First, the system finds phrasal correspondences by Japanese-English translation dictionary consultation. Then, the system finds correspondences in remaining phrases by using sentences dependency structures and the balance of all correspondences. The method is based on an assumption that in parallel corpus most fragments in a source sentence have corresponding fragments in a target sentence.

2000

pdf bib
Dialogue Helpsystem based on Flexible Matching of User Query with Natural Language Knowledge Base
Sadao Kurohashi | Wataru Higasa
1st SIGdial Workshop on Discourse and Dialogue

pdf bib
Nonlocal Language Modeling based on Context Co-occurrence Vectors
Sadao Kurohashi | Manabu Ori
2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Discourse Structure Analysis for News Video
Yasuhiko Watanabe | Yoshihiro Okada | Sadao Kurohashi | Eiichi Iwanari
Proceedings of the COLING-2000 Workshop on Semantic Annotation and Intelligent Content

pdf bib
Japanese Case Structure Analysis
Daisuke Kawahara | Nobuhiro Kaji | Sadao Kurohashi
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf bib
Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation
Hideo Watanabe | Sadao Kurohashi | Eiji Aramaki
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf bib
A Parallel English-Japanese Query Collection for the Evaluation of On-Line Help Systems
Richard F. E. Sutcliffe | Sadao Kurohashi
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

1999

pdf bib
Semantic Analysis of Japanese Noun Phrases - A New Approach to Dictionary-Based Understanding
Sadao Kurohashi | Yasuyuki Sakai
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf bib
Construction of Japanese Nominal Semantic Dictionary using “A NO B” Phrases in Corpora
Sadao Kurohashi | Masaki Murata | Yasunori Yata | Mitsunobu Shimada | Makoto Nagao
The Computational Treatment of Nominals

pdf bib
General Word Sense Disambiguation Method Based on a Full Sentential Context
Jiri Stetina | Sadao Kurohashi | Makoto Nagao
Usage of WordNet in Natural Language Processing Systems

1995

pdf bib abs
Analyzing Coordinate Structures Including Punctuation in English
Sadao Kurohashi
Proceedings of the Fourth International Workshop on Parsing Technologies

We present a met hod of identifying coordinate structure scopes and determining usages of commas in sentences at the same time. All possible interpretations concerning comma usages and coordinate structure scopes are ranked by taking advantage of parallelism between conjoined phrases/clauses/sentences and calculating their similarity scores. We evaluated this method through experiments on held-out test sentences and obtained promising results: both the success ratio of interpreting commas and that of detecting CS scopes were about 80%.

1994

pdf bib
A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures
Sadao Kurohashi | Makoto Nagao
Computational Linguistics, Volume 20, Number 4, December 1994

pdf bib
Automatic Detection of Discourse Structure by Checking Surface Information in Sentences
Sadao Kurohashi | Makoto Nagao
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics

1993

pdf bib abs
Structural Disambiguation in Japanese by Evaluating Case Structures based on Examples in a Case Frame Dictionary
Sadao Kurohashi | Makoto Nagao
Proceedings of the Third International Workshop on Parsing Technologies

A case structure expression is one of the most important forms to represent the meaning of a sentence. Case structure analysis is usually performed by consulting case frame information in verb dictionaries and by selecting a proper case frame for an input sentence. However, this analysis is very difficult because of word sense ambiguity and structural ambiguity. A conventional method for solving these problems is to use the method of selectional restriction, but this method has a drawback in the semantic marker (SM) system – the trade-off between descriptive power and construction cost. This paper describes a method of case structure analysis of Japanese sentences which overcomes the drawback in the SM system, concentrating on the structural disambiguation. This method selects a proper case frame for an input by the similarity measure between the input and typical example sentences of each case frame. When there are two or more possible readings for an input because of structural ambiguity, the best reading will be selected by evaluating case structures in each possible reading by the similarity measure with typical example sentences of case frames.