Yuji Matsumoto

Also published as: Yūji Matsumoto

2023

pdf abs
CovRelex-SE: Adding Semantic Information for Relation Search via Sequence Embedding
Truong Do | Chau Nguyen | Vu Tran | Ken Satoh | Yuji Matsumoto | Minh Nguyen
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

In recent years, COVID-19 has impacted all aspects of human life. As a result, numerous publications relating to this disease have been issued. Due to the massive volume of publications, some retrieval systems have been developed to provide researchers with useful information. In these systems, lexical searching methods are widely used, which raises many issues related to acronyms, synonyms, and rare keywords. In this paper, we present a hybrid relation retrieval system, CovRelex-SE, based on embeddings to provide high-quality search results. Our system can be accessed through the following URL: https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex-se/

pdf
24-bit Languages
Yiran Wang | Taro Watanabe | Masao Utiyama | Yuji Matsumoto
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf abs
Unsupervised Paraphrasing of Multiword Expressions
Takashi Wada | Yuji Matsumoto | Timothy Baldwin | Jey Han Lau
Findings of the Association for Computational Linguistics: ACL 2023

We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised systems and rivals supervised systems.

2022

pdf abs
Unsupervised Lexical Substitution with Decontextualised Embeddings
Takashi Wada | Timothy Baldwin | Yuji Matsumoto | Jey Han Lau
Proceedings of the 29th International Conference on Computational Linguistics

We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We conduct experiments in English and Italian, and show that our method substantially outperforms strong baselines and establishes a new state-of-the-art without any explicit supervision or fine-tuning. We further show that our method performs particularly well at predicting low-frequency substitutes, and also generates a diverse list of substitute candidates, reducing morphophonetic or morphosyntactic biases induced by article-noun agreement.

pdf abs
Coordination Generation via Synchronized Text-Infilling
Hiroki Teranishi | Yuji Matsumoto
Proceedings of the 29th International Conference on Computational Linguistics

Generating synthetic data for supervised learning from large-scale pre-trained language models has enhanced performances across several NLP tasks, especially in low-resource scenarios. In particular, many studies of data augmentation employ masked language models to replace words with other words in a sentence. However, most of them are evaluated on sentence classification tasks and cannot immediately be applied to tasks related to the sentence structure. In this paper, we propose a simple yet effective approach to generating sentences with a coordinate structure in which the boundaries of its conjuncts are explicitly specified. For a given span in a sentence, our method embeds a mask with a coordinating conjunction in two ways (”X and [mask]”, ”[mask] and X”) and forces masked language models to fill the two blanks with an identical text. To achieve this, we introduce decoding methods for BERT and T5 models with the constraint that predictions for different masks are synchronized. Furthermore, we develop a training framework that effectively selects synthetic examples for the supervised coordination disambiguation task. We demonstrate that our method produces promising coordination instances that provide gains for the task in low-resource settings.

pdf bib abs
Improving Discriminative Learning for Zero-Shot Relation Extraction
Van-Hien Tran | Hiroki Ouchi | Taro Watanabe | Yuji Matsumoto
Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge

Zero-shot relation extraction (ZSRE) aims to predict target relations that cannot be observed during training. While most previous studies have focused on fully supervised relation extraction and achieved considerably high performance, less effort has been made towards ZSRE. This study proposes a new model incorporating discriminative embedding learning for both sentences and semantic relations. In addition, a self-adaptive comparator network is used to judge whether the relationship between a sentence and a relation is consistent. Experimental results on two benchmark datasets showed that the proposed method significantly outperforms the state-of-the-art methods.

pdf abs
Global Entity Disambiguation with BERT
Ikuya Yamada | Koki Washio | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a global entity disambiguation (ED) model based on BERT. To capture global contextual information for ED, our model treats not only words but also entities as input tokens, and solves the task by sequentially resolving mentions to their referent entities and using resolved entities as inputs at each step. We train the model using a large entity-annotated corpus obtained from Wikipedia. We achieve new state-of-the-art results on five standard ED datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI. The source code and model checkpoint are available at https://github.com/studio-ousia/luke.

pdf abs
Out-of-Domain Discourse Dependency Parsing via Bootstrapping: An Empirical Analysis on Its Effectiveness and Limitation
Noriki Nishida | Yuji Matsumoto
Transactions of the Association for Computational Linguistics, Volume 10

Discourse parsing has been studied for decades. However, it still remains challenging to utilize discourse parsing for real-world applications because the parsing accuracy degrades significantly on out-of-domain text. In this paper, we report and discuss the effectiveness and limitations of bootstrapping methods for adapting modern BERT-based discourse dependency parsers to out-of-domain text without relying on additional human supervision. Specifically, we investigate self-training, co-training, tri-training, and asymmetric tri-training of graph-based and transition-based discourse dependency parsing models, as well as confidence measures and sample selection criteria in two adaptation scenarios: monologue adaptation between scientific disciplines and dialogue genre adaptation. We also release COVID-19 Discourse Dependency Treebank (COVID19-DTB), a new manually annotated resource for discourse dependency parsing of biomedical paper abstracts. The experimental results show that bootstrapping is significantly and consistently effective for unsupervised domain adaptation of discourse dependency parsing, but the low coverage of accurately predicted pseudo labels is a bottleneck for further improvement. We show that active learning can mitigate this limitation.

2021

pdf abs
Dependency Patterns of Complex Sentences and Semantic Disambiguation for Abstract Meaning Representation Parsing
Yuki Yamamoto | Yuji Matsumoto | Taro Watanabe
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Abstract Meaning Representation (AMR) is a sentence-level meaning representation based on predicate argument structure. One of the challenges we find in AMR parsing is to capture the structure of complex sentences which expresses the relation between predicates. Knowing the core part of the sentence structure in advance may be beneficial in such a task. In this paper, we present a list of dependency patterns for English complex sentence constructions designed for AMR parsing. With a dedicated pattern matcher, all occurrences of complex sentence constructions are retrieved from an input sentence. While some of the subordinators have semantic ambiguities, we deal with this problem through training classification models on data derived from AMR and Wikipedia corpus, establishing a new baseline for future works. The developed complex sentence patterns and the corresponding AMR descriptions will be made public.

pdf
Structured Refinement for Sequential Labeling
Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto | Taro Watanabe
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib abs
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora
Takashi Wada | Tomoharu Iwata | Yuji Matsumoto | Timothy Baldwin | Jey Han Lau
Proceedings of the 1st Workshop on Multilingual Representation Learning

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.

pdf abs
Removing Word-Level Spurious Alignment between Images and Pseudo-Captions in Unsupervised Image Captioning
Ukyo Honda | Yoshitaka Ushiku | Atsushi Hashimoto | Taro Watanabe | Yuji Matsumoto
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Unsupervised image captioning is a challenging task that aims at generating captions without the supervision of image-sentence pairs, but only with images and sentences drawn from different sources and object labels detected from the images. In previous work, pseudo-captions, i.e., sentences that contain the detected object labels, were assigned to a given image. The focus of the previous work was on the alignment of input images and pseudo-captions at the sentence level. However, pseudo-captions contain many words that are irrelevant to a given image. In this work, we investigate the effect of removing mismatched words from image-sentence alignment to determine how they make this task difficult. We propose a simple gating mechanism that is trained to align image features with only the most reliable words in pseudo-captions: the detected object labels. The experimental results show that our proposed method outperforms the previous methods without introducing complex sentence-level learning objectives. Combined with the sentence-level alignment method of previous work, our method further improves its performance. These results confirm the importance of careful alignment in word-level details.

This paper presents CovRelex, a scientific paper retrieval system targeting entities and relations via relation extraction on COVID-19 scientific papers. This work aims at building a system supporting users efficiently in acquiring knowledge across a huge number of COVID-19 scientific papers published rapidly. Our system can be accessed via https://www.jaist.ac.jp/is/labs/nguyen-lab/systems/covrelex/.

pdf abs
Nested Named Entity Recognition via Explicitly Excluding the Influence of the Best Path
Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto | Taro Watanabe
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper presents a novel method for nested named entity recognition. As a layered method, our method extends the prior second-best path recognition method by explicitly excluding the influence of the best path. Our method maintains a set of hidden states at each time step and selectively leverages them to build a different potential function for recognition at each level. In addition, we demonstrate that recognizing innermost entities first results in better performance than the conventional outermost entities first scheme. We provide extensive experimental results on ACE2004, ACE2005, and GENIA datasets to show the effectiveness and efficiency of our proposed method.

2020

pdf bib
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Yuji Matsumoto | Stephan Oepen | Kenji Sagae | Djamé Seddah | Weiwei Sun | Anders Søgaard | Reut Tsarfaty | Dan Zeman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

pdf abs
LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
Ikuya Yamada | Akari Asai | Hiroyuki Shindo | Hideaki Takeda | Yuji Matsumoto
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.

pdf abs
Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia
Ikuya Yamada | Akari Asai | Jin Sakuma | Hiroyuki Shindo | Hideaki Takeda | Yoshiyasu Takefuji | Yuji Matsumoto
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at https://wikipedia2vec.github.io/.

We propose a simple method for nominal coordination boundary identification. As the main strength of our method, it can identify the coordination boundaries without training on labeled data, and can be applied even if coordination structure annotations are not available. Our system employs pre-trained word embeddings to measure the similarities of words and detects the span of coordination, assuming that conjuncts share syntactic and semantic similarities. We demonstrate that our method yields good results in identifying coordinated noun phrases in the GENIA corpus and is comparable to a recent supervised method for the case when the coordinator conjoins simple noun phrases.

2019

pdf abs
Playing by the Book: An Interactive Game Approach for Action Graph Extraction from Text
Ronen Tamari | Hiroyuki Shindo | Dafna Shahaf | Yuji Matsumoto
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

Understanding procedural text requires tracking entities, actions and effects as the narrative unfolds. We focus on the challenging real-world problem of action-graph extraction from materials science papers, where language is highly specialized and data annotation is expensive and scarce. We propose a novel approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game. A learning agent completes the game by executing the procedure correctly in a text-based simulated lab environment. The framework can complement existing approaches and enables richer forms of learning compared to static texts. We discuss potential limitations and advantages of the approach, and release a prototype proof-of-concept, hoping to encourage research in this direction.

pdf abs
Stochastic Tokenization with a Language Model for Neural Text Classification
Tatsuya Hiraoka | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

For unsegmented languages such as Japanese and Chinese, tokenization of a sentence has a significant impact on the performance of text classification. Sentences are usually segmented with words or subwords by a morphological analyzer or byte pair encoding and then encoded with word (or subword) representations for neural networks. However, segmentation is potentially ambiguous, and it is unclear whether the segmented tokens achieve the best performance for the target task. In this paper, we propose a method to simultaneously learn tokenization and text classification to address these problems. Our model incorporates a language model for unsupervised tokenization into a text classifier and then trains both models simultaneously. To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification. We conducted experiments on sentiment analysis as a text classification task and show that our method achieves better performance than previous methods.

pdf abs
Unsupervised Multilingual Word Embedding with Limited Resources using Neural Language Models
Takashi Wada | Tomoharu Iwata | Yuji Matsumoto
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recently, a variety of unsupervised methods have been proposed that map pre-trained word embeddings of different languages into the same space without any parallel data. These methods aim to find a linear transformation based on the assumption that monolingual word embeddings are approximately isomorphic between languages. However, it has been demonstrated that this assumption holds true only on specific conditions, and with limited resources, the performance of these methods decreases drastically. To overcome this problem, we propose a new unsupervised multilingual embedding method that does not rely on such assumption and performs well under resource-poor scenarios, namely when only a small amount of monolingual data (i.e., 50k sentences) are available, or when the domains of monolingual data are different across languages. Our proposed model, which we call ‘Multilingual Neural Language Models’, shares some of the network parameters among multiple languages, and encodes sentences of multiple languages into the same space. The model jointly learns word embeddings of different languages in the same space, and generates multilingual embeddings without any parallel data or pre-training. Our experiments on word alignment tasks have demonstrated that, on the low-resource condition, our model substantially outperforms existing unsupervised and even supervised methods trained with 500 bilingual pairs of words. Our model also outperforms unsupervised methods given different-domain corpora across languages. Our code is publicly available.

pdf abs
Relation Classification Using Segment-Level Attention-based CNN and Dependency-based RNN
Van-Hien Tran | Van-Thuy Phi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recently, relation classification has gained much success by exploiting deep neural networks. In this paper, we propose a new model effectively combining Segment-level Attention-based Convolutional Neural Networks (SACNNs) and Dependency-based Recurrent Neural Networks (DepRNNs). While SACNNs allow the model to selectively focus on the important information segment from the raw sequence, DepRNNs help to handle the long-distance relations from the shortest dependency path of relation entities. Experiments on the SemEval-2010 Task 8 dataset show that our model is comparable to the state-of-the-art without using any external lexical features.

pdf abs
Decomposed Local Models for Coordinate Structure Parsing
Hiroki Teranishi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a simple and accurate model for coordination boundary identification. Our model decomposes the task into three sub-tasks during training; finding a coordinator, identifying inside boundaries of a pair of conjuncts, and selecting outside boundaries of it. For inference, we make use of probabilities of coordinators and conjuncts in the CKY parsing to find the optimal combination of coordinate structures. Experimental results demonstrate that our model achieves state-of-the-art results, ensuring that the global structure of coordinations is consistent.

2018

pdf abs
Cooperating Tools for MWE Lexicon Management and Corpus Annotation
Yuji Matsumoto | Akihiko Kato | Hiroyuki Shindo | Toshio Morita
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

We present tools for lexicon and corpus management that offer cooperating functionality in corpus annotation. The former, named Cradle, stores a set of words and expressions where multi-word expressions are defined with their own part-of-speech information and internal syntactic structures. The latter, named ChaKi, manages text corpora with part-of-speech (POS) and syntactic dependency structure annotations. Those two tools cooperate so that the words and multi-word expressions stored in Cradle are directly referred to by ChaKi in conducting corpus annotation, and the words and expressions annotated in ChaKi can be output as a list of lexical entities that are to be stored in Cradle.

This paper discusses the representation of coordinate structures in the Universal Dependencies framework for two head-final languages, Japanese and Korean. UD applies a strict principle that makes the head of coordination the left-most conjunct. However, the guideline may produce syntactic trees which are difficult to accept in head-final languages. This paper describes the status in the current Japanese and Korean corpora and proposes alternative designs suitable for these languages.

pdf abs
A Span Selection Model for Semantic Role Labeling
Hiroki Ouchi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.

pdf
A Parallel Corpus of Arabic-Japanese News Articles
Go Inoue | Nizar Habash | Yuji Matsumoto | Hiroyuki Aoyama
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
PDFAnno: a Web-based Linguistic Annotation Tool for PDF Documents
Hiroyuki Shindo | Yohei Munesada | Yuji Matsumoto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Chemical Compounds Knowledge Visualization with Natural Language Processing and Linked Data
Kazunari Tanaka | Tomoya Iwakura | Yusuke Koyanagi | Noriko Ikeda | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Construction of Large-scale English Verbal Multiword Expression Annotated Corpus
Akihiko Kato | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
EMTC: Multilabel Corpus in Movie Domain for Emotion Analysis in Conversational Text
Duc-Anh Phan | Yuji Matsumoto
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
Automatic Error Correction on Japanese Functional Expressions Using Character-based Neural Machine Translation
Jun Liu | Fei Cheng | Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf
Reduction of Parameter Redundancy in Biaffine Classifiers with Symmetric and Circulant Weight Matrices
Tomoki Matsuno | Katsuhiko Hayashi | Takahiro Ishihara | Hitoshi Manabe | Yuji Matsumoto
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation

pdf abs
Ranking-Based Automatic Seed Selection and Noise Reduction for Weakly Supervised Relation Extraction
Van-Thuy Phi | Joan Santoso | Masashi Shimbo | Yuji Matsumoto
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This paper addresses the tasks of automatic seed selection for bootstrapping relation extraction, and noise reduction for distantly supervised relation extraction. We first point out that these tasks are related. Then, inspired by ranking relation instances and patterns computed by the HITS algorithm, and selecting cluster centroids using the K-means, LSA, or NMF method, we propose methods for selecting the initial seeds from an existing resource, or reducing the level of noise in the distantly labeled data. Experiments show that our proposed methods achieve a better performance than the baseline systems in both tasks.

pdf abs
Sentence Suggestion of Japanese Functional Expressions for Chinese-speaking Learners
Jun Liu | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of ACL 2018, System Demonstrations

We present a computer-assisted learning system, Jastudy, which is particularly designed for Chinese-speaking learners of Japanese as a second language (JSL) to learn Japanese functional expressions with suggestion of appropriate example sentences. The system automatically recognizes Japanese functional expressions using a free Japanese morphological analyzer MeCab, which is retrained on a new Conditional Random Fields (CRF) model. In order to select appropriate example sentences, we apply a pairwise-based machine learning tool, Support Vector Machine for Ranking (SVMrank) to estimate the complexity of the example sentences using Japanese–Chinese homographs as an important feature. In addition, we cluster the example sentences that contain Japanese functional expressions with two or more meanings and usages, based on part-of-speech, conjugation forms of verbs and semantic attributes, using the K-means clustering algorithm in Scikit-Learn. Experimental results demonstrate the effectiveness of our approach.

pdf abs
Dynamic Feature Selection with Attention in Incremental Parsing
Ryosuke Kohita | Hiroshi Noji | Yuji Matsumoto
Proceedings of the 27th International Conference on Computational Linguistics

One main challenge for incremental transition-based parsers, when future inputs are invisible, is to extract good features from a limited local context. In this work, we present a simple technique to maximally utilize the local features with an attention mechanism, which works as context- dependent dynamic feature selection. Our model learns, for example, which tokens should a parser focus on, to decide the next action. Our multilingual experiment shows its effectiveness across many languages. We also present an experiment with augmented test dataset and demon- strate it helps to understand the model’s behavior on locally ambiguous points.

2017

pdf abs
Improving Sequence to Sequence Neural Machine Translation by Utilizing Syntactic Dependency Information
An Nguyen Le | Ander Martinez | Akifumi Yoshimoto | Yuji Matsumoto
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Sequence to Sequence Neural Machine Translation has achieved significant performance in recent years. Yet, there are some existing issues that Neural Machine Translation still does not solve completely. Two of them are translation for long sentences and the “over-translation”. To address these two problems, we propose an approach that utilize more grammatical information such as syntactic dependencies, so that the output can be generated based on more abundant information. In our approach, syntactic dependencies is employed in decoding. In addition, the output of the model is presented not as a simple sequence of tokens but as a linearized tree construction. In order to assess the performance, we construct model based on an attention mechanism encoder-decoder model in which the source language is input to the encoder as a sequence and the decoder generates the target language as a linearized dependency tree structure. Experiments on the Europarl-v7 dataset of French-to-English translation demonstrate that our proposed method improves BLEU scores by 1.57 and 2.40 on datasets consisting of sentences with up to 50 and 80 tokens, respectively. Furthermore, the proposed method also solved the two existing problems, ineffective translation for long sentences and over-translation in Neural Machine Translation.

pdf abs
Coordination Boundary Identification with Similarity and Replaceability
Hiroki Teranishi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose a neural network model for coordination boundary detection. Our method relies on the two common properties - similarity and replaceability in conjuncts - in order to detect both similar pairs of conjuncts and dissimilar pairs of conjuncts. The model improves identification of clause-level coordination using bidirectional RNNs incorporating two properties as features. We show that our model outperforms the existing state-of-the-art methods on the coordination annotated Penn Treebank and Genia corpus without any syntactic information from parsers.

pdf abs
Segment-Level Neural Conditional Random Fields for Named Entity Recognition
Motoki Sato | Hiroyuki Shindo | Ikuya Yamada | Yuji Matsumoto
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We present Segment-level Neural CRF, which combines neural networks with a linear chain CRF for segment-level sequence modeling tasks such as named entity recognition (NER) and syntactic chunking. Our segment-level CRF can consider higher-order label dependencies compared with conventional word-level CRF. Since it is difficult to consider all possible variable length segments, our method uses segment lattice constructed from the word-level tagging model to reduce the search space. Performing experiments on NER and chunking, we demonstrate that our method outperforms conventional word-level CRF with neural networks.

pdf abs
Can Discourse Relations be Identified Incrementally?
Frances Yung | Hiroshi Noji | Yuji Matsumoto
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Humans process language word by word and construct partial linguistic structures on the fly before the end of the sentence is perceived. Inspired by this cognitive ability, incremental algorithms for natural language processing tasks have been proposed and demonstrated promising performance. For discourse relation (DR) parsing, however, it is not yet clear to what extent humans can recognize DRs incrementally, because the latent ‘nodes’ of discourse structure can span clauses and sentences. To answer this question, this work investigates incrementality in discourse processing based on a corpus annotated with DR signals. We find that DRs are dominantly signaled at the boundary between the two constituent discourse units. The findings complement existing psycholinguistic theories on expectation in discourse processing and provide direction for incremental discourse parsing.

pdf abs
Improving Neural Text Normalization with Data Augmentation at Character- and Morphological Levels
Itsumi Saito | Jun Suzuki | Kyosuke Nishida | Kugatsu Sadamitsu | Satoshi Kobashikawa | Ryo Masumura | Yuji Matsumoto | Junji Tomita
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

In this study, we investigated the effectiveness of augmented data for encoder-decoder-based neural normalization models. Attention based encoder-decoder models are greatly effective in generating many natural languages. % such as machine translation or machine summarization. In general, we have to prepare for a large amount of training data to train an encoder-decoder model. Unlike machine translation, there are few training data for text-normalization tasks. In this paper, we propose two methods for generating augmented data. The experimental results with Japanese dialect normalization indicate that our methods are effective for an encoder-decoder model and achieve higher BLEU score than that of baselines. We also investigated the oracle performance and revealed that there is sufficient room for improving an encoder-decoder model.

pdf bib abs
Multilingual Back-and-Forth Conversion between Content and Function Head for Easy Dependency Parsing
Ryosuke Kohita | Hiroshi Noji | Yuji Matsumoto
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Universal Dependencies (UD) is becoming a standard annotation scheme cross-linguistically, but it is argued that this scheme centering on content words is harder to parse than the conventional one centering on function words. To improve the parsability of UD, we propose a back-and-forth conversion algorithm, in which we preprocess the training treebank to increase parsability, and reconvert the parser outputs to follow the UD scheme as a postprocess. We show that this technique consistently improves LAS across languages even with a state-of-the-art parser, in particular on core dependency arcs such as nominal modifier. We also provide an in-depth analysis to understand why our method increases parsability.

pdf abs
Joint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information
Go Inoue | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Part-of-speech (POS) tagging for morphologically rich languages such as Arabic is a challenging problem because of their enormous tag sets. One reason for this is that in the tagging scheme for such languages, a complete POS tag is formed by combining tags from multiple tag sets defined for each morphosyntactic category. Previous approaches in Arabic POS tagging applied one model for each morphosyntactic tagging task, without utilizing shared information between the tasks. In this paper, we propose an approach that utilizes this information by jointly modeling multiple morphosyntactic tagging tasks with a multi-task learning framework. We also propose a method of incorporating tag dictionary information into our neural models by combining word representations with representations of the sets of possible tags. Our experiments showed that the joint model with tag dictionary information results in an accuracy of 91.38% on the Penn Arabic Treebank data set, with an absolute improvement of 2.11% over the current state-of-the-art tagger.

pdf abs
Adversarial Training for Cross-Domain Universal Dependency Parsing
Motoki Sato | Hitoshi Manabe | Hiroshi Noji | Yuji Matsumoto
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We describe our submission to the CoNLL 2017 shared task, which exploits the shared common knowledge of a language across different domains via a domain adaptation technique. Our approach is an extension to the recently proposed adversarial training technique for domain adaptation, which we apply on top of a graph-based neural dependency parsing model on bidirectional LSTMs. In our experiments, we find our baseline graph-based parser already outperforms the official baseline model (UDPipe) by a large margin. Further, by applying our technique to the treebanks of the same language with different domains, we observe an additional gain in the performance, in particular for the domains with less training data.

pdf
Sentence Complexity Estimation for Chinese-speaking Learners of Japanese
Jun Liu | Yuji Matsumoto
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation

pdf abs
A* CCG Parsing with a Supertag and Dependency Factored Model
Masashi Yoshikawa | Hiroshi Noji | Yuji Matsumoto
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a new A* CCG parsing model in which the probability of a tree is decomposed into factors of CCG categories and its syntactic dependencies both defined on bi-directional LSTMs. Our factored model allows the precomputation of all probabilities and runs very efficiently, while modeling sentence structures explicitly via dependencies. Our model achieves the state-of-the-art results on English and Japanese CCG parsing.

pdf abs
Neural Modeling of Multi-Predicate Interactions for Japanese Predicate Argument Structure Analysis
Hiroki Ouchi | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The performance of Japanese predicate argument structure (PAS) analysis has improved in recent years thanks to the joint modeling of interactions between multiple predicates. However, this approach relies heavily on syntactic information predicted by parsers, and suffers from errorpropagation. To remedy this problem, we introduce a model that uses grid-type recurrent neural networks. The proposed model automatically induces features sensitive to multi-predicate interactions from the word sequence information of a sentence. Experiments on the NAIST Text Corpus demonstrate that without syntactic information, our model outperforms previous syntax-dependent models.

pdf abs
English Multiword Expression-aware Dependency Parsing Including Named Entities
Akihiko Kato | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Because syntactic structures and spans of multiword expressions (MWEs) are independently annotated in many English syntactic corpora, they are generally inconsistent with respect to one another, which is harmful to the implementation of an aggregate system. In this work, we construct a corpus that ensures consistency between dependency structures and MWEs, including named entities. Further, we explore models that predict both MWE-spans and an MWE-aware dependency structure. Experimental results show that our joint model using additional MWE-span features achieves an MWE recognition improvement of 1.35 points over a pipeline model.

pdf abs
Effective Online Reordering with Arc-Eager Transitions
Ryosuke Kohita | Hiroshi Noji | Yuji Matsumoto
Proceedings of the 15th International Conference on Parsing Technologies

We present a new transition system with word reordering for unrestricted non-projective dependency parsing. Our system is based on decomposed arc-eager rather than arc-standard, which allows more flexible ambiguity resolution between a local projective and non-local crossing attachment. In our experiment on Universal Dependencies 2.0, we find our parser outperforms the ordinary swap-based parser particularly on languages with a large amount of non-projectivity.

2016

pdf abs
Identification of Flexible Multiword Expressions with the Help of Dependency Structure Annotation
Ayaka Morimoto | Akifumi Yoshimoto | Akihiko Kato | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the Workshop on Grammar and Lexicon: interactions and interfaces (GramLex)

This paper presents our ongoing work on compilation of English multi-word expression (MWE) lexicon. We are especially interested in collecting flexible MWEs, in which some other components can intervene the expression such as “a number of” vs “a large number of” where a modifier of “number” can be placed in the expression and inherit the original meaning. We fiest collect possible candidates of flexible English MWEs from the web, and annotate all of their occurrences in the Wall Street Journal portion of Ontonotes corpus. We make use of word dependency strcuture information of the sentences converted from the phrase structure annotation. This process enables semi-automatic annotation of MWEs in the corpus and simultanaously produces the internal and external dependency representation of flexible MWEs.

pdf abs
Japanese Text Normalization with Encoder-Decoder Model
Taishi Ikeda | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)

Text normalization is the task of transforming lexical variants to their canonical forms. We model the problem of text normalization as a character-level sequence to sequence learning problem and present a neural encoder-decoder model for solving it. To train the encoder-decoder model, many sentences pairs are generally required. However, Japanese non-standard canonical pairs are scarce in the form of parallel corpora. To address this issue, we propose a method of data augmentation to increase data size by converting existing resources into synthesized non-standard forms using handcrafted rules. We conducted an experiment to demonstrate that the synthesized corpus contributes to stably train an encoder-decoder model and improve the performance of Japanese text normalization.

pdf abs
Global Pre-ordering for Improving Sublanguage Translation
Masaru Fuji | Masao Utiyama | Eiichiro Sumita | Yuji Matsumoto
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

When translating formal documents, capturing the sentence structure specific to the sublanguage is extremely necessary to obtain high-quality translations. This paper proposes a novel global reordering method with particular focus on long-distance reordering for capturing the global sentence structure of a sublanguage. The proposed method learns global reordering models from a non-annotated parallel corpus and works in conjunction with conventional syntactic reordering. Experimental results on the patent abstract sublanguage show substantial gains of more than 25 points in the RIBES metric and comparable BLEU scores both for Japanese-to-English and English-to-Japanese translations.

pdf bib abs
Simplification of Example Sentences for Learners of Japanese Functional Expressions
Jun Liu | Yuji Matsumoto
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

Learning functional expressions is one of the difficulties for language learners, since functional expressions tend to have multiple meanings and complicated usages in various situations. In this paper, we report an experiment of simplifying example sentences of Japanese functional expressions especially for Chinese-speaking learners. For this purpose, we developed “Japanese Functional Expressions List” and “Simple Japanese Replacement List”. To evaluate the method, we conduct a small-scale experiment with Chinese-speaking learners on the effectiveness of the simplified example sentences. The experimental results indicate that simplified sentences are helpful in learning Japanese functional expressions.

pdf abs
Japanese Lexical Simplification for Non-Native Speakers
Muhaimin Hading | Yuji Matsumoto | Maki Sakamoto
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

This paper introduces Japanese lexical simplification. Japanese lexical simplification is the task of replacing difficult words in a given sentence to produce a new sentence with simple words without changing the original meaning of the sentence. We purpose a method of supervised regression learning to estimate difficulty ordering of words with statistical features obtained from two types of Japanese corpora. For the similarity of words, we use a Japanese thesaurus and dependency-based word embeddings. Evaluation of the proposed method is performed by comparing the difficulty ordering of the words.

pdf abs
BCCWJ-DepPara: A Syntactic Annotation Treebank on the ‘Balanced Corpus of Contemporary Written Japanese’
Masayuki Asahara | Yuji Matsumoto
Proceedings of the 12th Workshop on Asian Language Resources (ALR12)

Paratactic syntactic structures are difficult to represent in syntactic dependency tree structures. As such, we propose an annotation schema for syntactic dependency annotation of Japanese, in which coordinate structures are split from and overlaid on bunsetsu-based (base phrase unit) dependency. The schema represents nested coordinate structures, non-constituent conjuncts, and forward sharing as the set of regions. The annotation was performed on the core data of ‘Balanced Corpus of Contemporary Written Japanese’, which comprised about one million words and 1980 samples from six registers, such as newspapers, books, magazines, and web texts.

pdf
Joint Transition-based Dependency Parsing and Disfluency Detection for Automatic Speech Recognition Texts
Masashi Yoshikawa | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Modelling the Usage of Discourse Connectives as Rational Speech Acts
Frances Yung | Kevin Duh | Taku Komura | Yuji Matsumoto
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Yuji Matsumoto | Rashmi Prasad
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

pdf abs
Demonstration of ChaKi.NET – beyond the corpus search system
Masayuki Asahara | Yuji Matsumoto | Toshio Morita
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

ChaKi.NET is a corpus management system for dependency structure annotated corpora. After more than 10 years of continuous development, the system is now usable not only for corpus search, but also for visualization, annotation, labelling, and formatting for statistical analysis. This paper describes the various functions included in the current ChaKi.NET system.

pdf abs
Improving Neural Machine Translation on resource-limited pairs using auxiliary data of a third language
Ander Martinez | Yuji Matsumoto
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

In the recent years interest in Deep Neural Networks (DNN) has grown in the field of Natural Language Processing, as new training methods have been proposed. The usage of DNN has achieved state-of-the-art performance in various areas. Neural Machine Translation (NMT) described by Bahdanau et al. (2014) and its successive variations have shown promising results. DNN, however, tend to over-fit on small data-sets, which makes this method impracticable for resource-limited language pairs. This article combines three different ideas (splitting words into smaller units, using an extra dataset of a related language pair and using monolingual data) for improving the performance of NMT models on language pairs with limited data. Our experiments show that, in some cases, our proposed approach to subword-units performs better than BPE (Byte pair encoding) and that auxiliary language-pairs and monolingual data can help improve the performance of languages with limited resources.

pdf
Discriminative Reranking for Grammatical Error Correction with Statistical Machine Translation
Tomoya Mizumoto | Yuji Matsumoto
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Modelling the Interpretation of Discourse Connectives by Bayesian Pragmatics
Frances Yung | Kevin Duh | Taku Komura | Yuji Matsumoto
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
A Generalized Framework for Hierarchical Word Sequence Language Model
Xiaoyi Wu | Kevin Duh | Yuji Matsumoto
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf
Multiple Emotions Detection in Conversation Transcripts
Duc-Anh Phan | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

pdf
Integrating Word Embedding Offsets into the Espresso System for Part-Whole Relation Extraction
Van-Thuy Phi | Yuji Matsumoto
Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation: Oral Papers

We present an attempt to port the international syntactic annotation scheme, Universal Dependencies, to the Japanese language in this paper. Since the Japanese syntactic structure is usually annotated on the basis of unique chunk-based dependencies, we first introduce word-based dependencies by using a word unit called the Short Unit Word, which usually corresponds to an entry in the lexicon UniDic. Porting is done by mapping the part-of-speech tagset in UniDic to the universal part-of-speech tagset, and converting a constituent-based treebank to a typed dependency tree. The conversion is not straightforward, and we discuss the problems that arose in the conversion and the current solutions. A treebank consisting of 10,000 sentences was built by converting the existent resources and currently released to the public.

pdf abs
Construction of an English Dependency Corpus incorporating Compound Function Words
Akihiko Kato | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The recognition of multiword expressions (MWEs) in a sentence is important for such linguistic analyses as syntactic and semantic parsing, because it is known that combining an MWE into a single token improves accuracy for various NLP tasks, such as dependency parsing and constituency parsing. However, MWEs are not annotated in Penn Treebank. Furthermore, when converting word-based dependency to MWE-aware dependency directly, one could combine nodes in an MWE into a single node. Nevertheless, this method often leads to the following problem: A node derived from an MWE could have multiple heads and the whole dependency structure including MWE might be cyclic. Therefore we converted a phrase structure to a dependency structure after establishing an MWE as a single subtree. This approach can avoid an occurrence of multiple heads and/or cycles. In this way, we constructed an English dependency corpus taking into account compound function words, which are one type of MWEs that serve as functional expressions. In addition, we report experimental results of dependency parsing using a constructed corpus.

2015

pdf bib
Patent claim translation based on sublanguage-specific sentence structure
Masaru Fuji | Atsushi Fujita | Masao Utiyama | Eiichiro Sumita | Yuji Matsumoto
Proceedings of Machine Translation Summit XV: Papers

pdf
Joint Case Argument Identification for Japanese Predicate Argument Structure Analysis
Hiroki Ouchi | Hiroyuki Shindo | Kevin Duh | Yuji Matsumoto
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Synthetic Word Parsing Improves Chinese Word Segmentation
Fei Cheng | Kevin Duh | Yuji Matsumoto
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf
Semantic Structure Analysis of Noun Phrases using Abstract Meaning Representation
Yuichiro Sawai | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf
Coordination-Aware Dependency Parsing (Preliminary Report)
Akifumi Yoshimoto | Kazuo Hara | Masashi Shimbo | Yuji Matsumoto
Proceedings of the 14th International Conference on Parsing Technologies

pdf
CKY Parsing with Independence Constraints
Joseph Irwin | Yuji Matsumoto
Proceedings of the 14th International Conference on Parsing Technologies

pdf
Crosslingual Annotation and Analysis of Implicit Discourse Connectives for Machine Translation
Frances Yung | Kevin Duh | Yuji Matsumoto
Proceedings of the Second Workshop on Discourse in Machine Translation

pdf bib
Sequential Annotation and Chunking of Chinese Discourse Structure
Frances Yung | Kevin Duh | Yuji Matsumoto
Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

pdf bib
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications
Hsin-Hsi Chen | Yuen-Hsien Tseng | Yuji Matsumoto | Lung Hsiang Wong
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

pdf
Collocation Assistant for Learners of Japanese as a Second Language
Lis Pereira | Yuji Matsumoto
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

pdf
Grammatical Error Correction Considering Multi-word Expressions
Tomoya Mizumoto | Masato Mita | Yuji Matsumoto
Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications

pdf bib
Keynote Lecture 1: Scientific Paper Analysis
Yuji Matsumoto
Proceedings of the 12th International Conference on Natural Language Processing

pdf
An Improved Hierarchical Word Sequence Language Model Using Directional Information
Xiaoyi Wu | Yuji Matsumoto
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation

pdf
An Efficient Annotation for Phrasal Verbs using Dependency Information
Masayuki Komai | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf
A Hierarchical Word Sequence Language Model
Xiaoyi Wu | Yuji Matsumoto
Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing

pdf
Identifying collocations using cross-lingual association measures
Lis Pereira | Elga Strafella | Kevin Duh | Yuji Matsumoto
Proceedings of the 10th Workshop on Multiword Expressions (MWE)

pdf abs
Collocation or Free Combination? — Applying Machine Translation Techniques to identify collocations in Japanese
Lis Pereira | Elga Strafella | Yuji Matsumoto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This work presents an initial investigation on how to distinguish collocations from free combinations. The assumption is that, while free combinations can be literally translated, the overall meaning of collocations is different from the sum of the translation of its parts. Based on that, we verify whether a machine translation system can help us perform such distinction. Results show that it improves the precision compared with standard methods of collocation identification through statistical association measures.

pdf abs
Parsing Chinese Synthetic Words with a Character-based Dependency Model
Fei Cheng | Kevin Duh | Yuji Matsumoto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Synthetic word analysis is a potentially important but relatively unexplored problem in Chinese natural language processing. Two issues with the conventional pipeline methods involving word segmentation are (1) the lack of a common segmentation standard and (2) the poor segmentation performance on OOV words. These issues may be circumvented if we adopt the view of character-based parsing, providing both internal structures to synthetic words and global structure to sentences in a seamless fashion. However, the accuracy of synthetic word parsing is not yet satisfactory, due to the lack of research. In view of this, we propose and present experiments on several synthetic word parsers. Additionally, we demonstrate the usefulness of incorporating large unlabelled corpora and a dictionary for this task. Our parsers significantly outperform the baseline (a pipeline method).

pdf
Improving Dependency Parsers with Supertags
Hiroki Ouchi | Kevin Duh | Yuji Matsumoto
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf
Analysis and Prediction of Unalignable Words in Parallel Text
Frances Yung | Kevin Duh | Yuji Matsumoto
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

2013

pdf
What Information is Helpful for Dependency Based Semantic Role Labeling
Yanyan Luo | Kevin Duh | Yuji Matsumoto
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Towards Automatic Error Type Classification of Japanese Language Learners’ Writings
Hiromi Oyama | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)

pdf
Modeling and Learning Semantic Co-Compositionality through Prototype Projections and Neural Networks
Masashi Tsubaki | Kevin Duh | Masashi Shimbo | Yuji Matsumoto
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf abs
Efficient Stacked Dependency Parsing by Forest Reranking
Katsuhiko Hayashi | Shuhei Kondo | Yuji Matsumoto
Transactions of the Association for Computational Linguistics, Volume 1

This paper proposes a discriminative forest reranking algorithm for dependency parsing that can be seen as a form of efficient stacked parsing. A dynamic programming shift-reduce parser produces a packed derivation forest which is then scored by a discriminative reranker, using the 1-best tree output by the shift-reduce parser as guide features in addition to third-order graph-based features. To improve efficiency and accuracy, this paper also proposes a novel shift-reduce parser that eliminates the spurious ambiguity of arc-standard transition systems. Testing on the English Penn Treebank data, forest reranking gave a state-of-the-art unlabeled dependency accuracy of 93.12.

pdf
A Learner Corpus-based Approach to Verb Suggestion for ESL
Yu Sawai | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Automated Collocation Suggestion for Japanese Second Language Learners
Lis Pereira | Erlyn Manguilimotan | Yuji Matsumoto
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

pdf
NAIST at the NLI 2013 Shared Task
Tomoya Mizumoto | Yuta Hayashibe | Keisuke Sakaguchi | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf
Hidden Markov Tree Model for Word Alignment
Shuhei Kondo | Kevin Duh | Yuji Matsumoto
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf
Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus
Xiaodong Liu | Kevin Duh | Yuji Matsumoto
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

pdf
A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking
Xiaodong Liu | Kevin Cheng | Yanyan Luo | Kevin Duh | Yuji Matsumoto
Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing

2012

pdf
Walk-based Computation of Contextual Word Similarity
Kazuo Hara | Ikumi Suzuki | Masashi Shimbo | Yuji Matsumoto
Proceedings of COLING 2012

pdf
Joint English Spelling Error Correction and POS Tagging for Language Learners Writing
Keisuke Sakaguchi | Tomoya Mizumoto | Mamoru Komachi | Yuji Matsumoto
Proceedings of COLING 2012

pdf
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writings
Tomoya Mizumoto | Yuta Hayashibe | Mamoru Komachi | Masaaki Nagata | Yuji Matsumoto
Proceedings of COLING 2012: Posters

pdf
Things between Lexicon and Grammar
Yuji Matsumoto
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation

pdf
Head-driven Transition-based Parsing with Top-down Prediction
Katsuhiko Hayashi | Taro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Tense and Aspect Error Correction for ESL Learners Using Global Context
Toshikazu Tajiri | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf abs
UniDic for Early Middle Japanese: a Dictionary for Morphological Analysis of Classical Japanese
Toshinobu Ogiso | Mamoru Komachi | Yasuharu Den | Yuji Matsumoto
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In order to construct an annotated diachronic corpus of Japanese, we propose to create a new dictionary for morphological analysis of Early Middle Japanese (Classical Japanese) based on UniDic, a dictionary for Contemporary Japanese. Differences between the Early Middle Japanese and Contemporary Japanese, which prevent a naïve adaptation of UniDic to Early Middle Japanese, are found at the levels of lexicon, morphology, grammar, orthography and pronunciation. In order to overcome these problems, we extended dictionary entries and created a training corpus of Early Middle Japanese to adapt UniDic for Contemporary Japanese to Early Middle Japanese. Experimental results show that the proposed UniDic-EMJ, a new dictionary for Early Middle Japanese, achieves as high accuracy (97%) as needed for the linguistic research on lexicon and grammar in Japanese classical text analysis.

2011

pdf
Using the Mutual k-Nearest Neighbor Graphs for Semi-supervised Classification on Natural Language Data
Kohei Ozaki | Masashi Shimbo | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

pdf
Narrative Schema as World Knowledge for Coreference Resolution
Joseph Irwin | Mamoru Komachi | Yuji Matsumoto
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

pdf
Different Input Systems for Different Devices
Asad Habib | Masakazu Iwatate | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

pdf
Error Correcting Romaji-kana Conversion for Japanese Language Education
Seiji Kasahara | Mamoru Komachi | Masaaki Nagata | Yuji Matsumoto
Proceedings of the Workshop on Advances in Text Input Methods (WTIM 2011)

pdf bib
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Dekang Lin | Yuji Matsumoto | Rada Mihalcea
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
HITS-based Seed Selection and Stop List Construction for Bootstrapping
Tetsuo Kiso | Masashi Shimbo | Mamoru Komachi | Yuji Matsumoto
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Multilayer Sequence Labeling
Ai Azuma | Yuji Matsumoto
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf
Third-order Variational Reranking on Packed-Shared Dependency Forests
Katsuhiko Hayashi | Taro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf
Dependency-based Analysis for Tagalog Sentences
Erlyn Manguilimotan | Yuji Matsumoto
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf
Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners
Tomoya Mizumoto | Mamoru Komachi | Masaaki Nagata | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf
Japanese Predicate Argument Structure Analysis Exploiting Argument Position and Type
Yuta Hayashibe | Mamoru Komachi | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf
Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Teruaki Oka | Mamoru Komachi | Toshinobu Ogiso | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf
Jointly Extracting Japanese Predicate-Argument Relation with Markov Logic
Katsumasa Yoshikawa | Masayuki Asahara | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf
A Structured Model for Joint Learning of Argument Roles and Predicate Senses
Yotaro Watanabe | Masayuki Asahara | Yuji Matsumoto
Proceedings of the ACL 2010 Conference Short Papers

pdf abs
Annotating Event Mentions in Text with Modality, Focus, and Source Information
Suguru Matsuyoshi | Megumi Eguchi | Chitose Sao | Koji Murakami | Kentaro Inui | Yuji Matsumoto
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many natural language processing tasks, including information extraction, question answering and recognizing textual entailment, require analysis of the polarity, focus of polarity, tense, aspect, mood and source of the event mentions in a text in addition to its predicate-argument structure analysis. We refer to modality, polarity and other associated information as extended modality. In this paper, we propose a new annotation scheme for representing the extended modality of event mentions in a sentence. Our extended modality consists of the following seven components: Source, Time, Conditional, Primary modality type, Actuality, Evaluation and Focus. We reviewed the literature about extended modality in Linguistics and Natural Language Processing (NLP) and defined appropriate labels of each component. In the proposed annotation scheme, information of extended modality of an event mention is summarized at the core predicate of the event mention for immediate use in NLP applications. We also report on the current progress of our manual annotation of a Japanese corpus of about 50,000 event mentions, showing a reasonably high ratio of inter-annotator agreement.

Large scale annotated corpora are very important not only inlinguistic research but also in practical natural language processingtasks since a number of practical tools such as Part-of-speech (POS) taggers and syntactic parsers are now corpus-based or machine learning-based systems which require some amount of accurately annotated corpora. This article presents an annotated corpus management tool that provides various functions that include flexible search, statistic calculation, and error correction for linguistically annotated corpora. The target of annotation covers POS tags, base phrase chunks and syntactic dependency structures. This tool aims at helping development of consistent construction of lexicon and annotated corpora to be used by researchers both in linguists and language processing communities.

pdf abs
Augmenting a Semantic Verb Lexicon with a Large Scale Collection of Example Sentences
Kentaro Inui | Toru Hirano | Ryu Iida | Atsushi Fujita | Yuji Matsumoto
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

One of the crucial issues in semantic parsing is how to reduce costs of collecting a sufficiently large amount of labeled data. This paper presents a new approach to cost-saving annotation of example sentences with predicate-argument structure information, taking Japanese as a target language. In this scheme, a large collection of unlabeled examples are first clustered and selectively sampled, and for each sampled cluster, only one representative example is given a label by a human annotator. The advantages of this approach are empirically supported by the results of our preliminary experiments, where we use an existing similarity function and naive sampling strategy.

2005

pdf
Chinese Word Segmentation by Classification of Characters
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
International Journal of Computational Linguistics & Chinese Language Processing, Volume 10, Number 3, September 2005: Special Issue on Selected Papers from ROCLING XVI

pdf
Automatic Extraction of Fixed Multiword Expressions
Campbell Hore | Masayuki Asahara | Yūji Matsumoto
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
Building a Japanese-Chinese Dictionary Using Kanji/Hanzi Conversion
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
Exploiting Lexical Conceptual Structure for Paraphrase Generation
Atsushi Fujita | Kentaro Inui | Yuji Matsumoto
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
Opinion Extraction Using a Learning-Based Anaphora Resolution Technique
Nozomi Kobayashi | Ryu Iida | Kentaro Inui | Yuji Matsumoto
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Chinese Deterministic Dependency Analyzer: Examining Effects of Global Features and Root Node Finder
Yuchang Cheng | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing

2004

pdf
Japanese Unknown Word Identification by Character-based Chunking
Masayuki Asahara | Yuji Matsumoto
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Trajectory Based Word Sense Disambiguation
Xiaojie Wang | Yuji Matsumoto
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf
Building a Paraphrase Corpus for Speech Translation
Mitsuo Shimohata | Eiichiro Sumita | Yuji Matsumoto
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)

pdf bib
Paraphrasing of Japanese Light-verb Constructions Based on Lexical Conceptual Structure
Atsushi Fujita | Kentaro Furihata | Kentaro Inui | Yuji Matsumoto | Koichi Takeuchi
Proceedings of the Workshop on Multiword Expressions: Integrating Processing

pdf
Chinese Word Segmentation by Classification of Characters
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
Proceedings of the Third SIGHAN Workshop on Chinese Language Processing

pdf
Modeling Category Structures with a Kernel Function
Hiroya Takamura | Yuji Matsumoto | Hiroyasu Yamada
Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004

pdf
Applying Conditional Random Fields to Japanese Morphological Analysis
Taku Kudo | Kaoru Yamamoto | Yuji Matsumoto
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf
A Boosting Algorithm for Classification of Semi-Structured Text
Taku Kudo | Yuji Matsumoto
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

pdf bib
Machine Learning based NLP : Experiences and Supporting Tools
Yuji Matsumoto
Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation

pdf
Japanese Subjects and Information Structure : A Constraint-based Approach
Akira Ohtani | Yuji Matsumoto
Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation

pdf
Pruning False Unknown Words to Improve Chinese Word Segmentation
Chooi-Ling Goh | Masayuki Asahara | Yuji Matsumoto
Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation

pdf
Method for retrieving a similar sentence and its application to machine translation
Mitsuo Shimohata | Eiichiro Sumita | Yuji Matsumoto
Proceedings of the 10th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

2003

pdf
Retrieving Meaning-equivalent Sentences for Example-based Rough Translation
Mitsuo Shimohata | Eiichiro Sumita | Yuji Matsumoto
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf
Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining
Kaoru Yamamoto | Taku Kudo | Yuta Tsuboi | Yuji Matsumoto
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond

pdf
Feature Selection in Categorizing Procedural Expressions
Mineki Takechi | Takenobu Tokunaga | Yuji Matsumoto | Hozumi Tanaka
Proceedings of the Sixth International Workshop on Information Retrieval with Asian Languages

pdf
Protein Name Tagging for Biomedical Annotation in Text
Kaoru Yamamoto | Taku Kudo | Akihiko Konagaya | Yuji Matsumoto
Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine

pdf
Combining Segmenter and Chunker for Chinese Word Segmentation
Masayuki Asahara | Chooi Ling Goh | Xiaojie Wang | Yuji Matsumoto
Proceedings of the Second SIGHAN Workshop on Chinese Language Processing

pdf
Incorporating Contextual Cues in Trainable Models for Coreference Resolution
Ryu Iida | Kentaro Inui | Hiroya Takamura | Yuji Matsumoto
Proceedings of the 2003 EACL Workshop on The Computational Treatment of Anaphora

pdf abs
Statistical Dependency Analysis with Support Vector Machines
Hiroyasu Yamada | Yuji Matsumoto
Proceedings of the Eighth International Conference on Parsing Technologies

In this paper, we propose a method for analyzing word-word dependencies using deterministic bottom-up manner using Support Vector machines. We experimented with dependency trees converted from Penn treebank data, and achieved over 90% accuracy of word-word dependency. Though the result is little worse than the most up-to-date phrase structure based parsers, it looks satisfactorily accurate considering that our parser uses no information from phrase structures.

pdf
Automatic Construction of Machine Translation Knowledge Using Translation Literalness
Kenji Imamura | Eiichiro Sumita | Yuji Matsumoto
10th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Fast Methods for Kernel-Based Text Analysis
Taku Kudo | Yuji Matsumoto
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf
Feedback Cleaning of Machine Translation Rules Using Automatic Evaluation
Kenji Imamura | Eiichiro Sumita | Yuji Matsumoto
Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics

pdf
Chinese Unknown Word Identification Using Character-based Tagging and Chunking
Chooi Ling Goh | Masayuki Asahara | Yuji Matsumoto
The Companion Volume to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics

pdf bib
Japanese Named Entity Extraction with Redundant Morphological Analysis
Masayuki Asahara | Yuji Matsumoto
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

pdf abs
Example-based rough translation for speech-to-speech translation
Mitsuo Shimohata | Eiichiro Sumita | Yuji Matsumoto
Proceedings of Machine Translation Summit IX: Papers

Example-based machine translation (EBMT) is a promising translation method for speech-to-speech translation (S2ST) because of its robustness. However, it has two problems in that the performance degrades when input sentences are long and when the style of the input sentences and that of the example corpus are different. This paper proposes example-based rough translation to overcome these two problems. The rough translation method relies on “meaning-equivalent sentences,” which share the main meaning with an input sentence despite missing some unimportant information. This method facilitates retrieval of meaning-equivalent sentences for long input sentences. The retrieval of meaning-equivalent sentences is based on content words, modality, and tense. This method also provides robustness against the style differences between the input sentence and the example corpus.