Hideki Nakayama

2021

pdf bib abs
SciXGen: A Scientific Paper Dataset for Context-Aware Text Generation
Hong Chen | Hiroya Takamura | Hideki Nakayama
Findings of the Association for Computational Linguistics: EMNLP 2021

Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called context. We push forward the scientific text generation by proposing a new task, namely context-aware text generation in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale Scientific Paper Dataset for ConteXt-Aware Text Generation (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.

This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021). For the WAT2021, 28 teams participated in the shared tasks and 24 teams submitted their translation results for the human evaluation. We also accepted 5 research papers. About 2,100 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
GraphPlan: Story Generation by Planning with Event Graph
Hong Chen | Raphael Shu | Hiroya Takamura | Hideki Nakayama
Proceedings of the 14th International Conference on Natural Language Generation

Story generation is a task that aims to automatically generate a meaningful story. This task is challenging because it requires high-level understanding of the semantic meaning of sentences and causality of story events. Naivesequence-to-sequence models generally fail to acquire such knowledge, as it is difficult to guarantee logical correctness in a text generation model without strategic planning. In this study, we focus on planning a sequence of events assisted by event graphs and use the events to guide the generator. Rather than using a sequence-to-sequence model to output a sequence, as in some existing works, we propose to generate an event sequence by walking on an event graph. The event graphs are built automatically based on the corpus. To evaluate the proposed approach, we incorporate human participation, both in event planning and story generation. Based on the largescale human annotation results, our proposed approach has been shown to provide more logically correct event sequences and stories compared with previous approaches.

2020

pdf bib abs
A Visually-grounded First-person Dialogue Dataset with Verbal and Non-verbal Responses
Hisashi Kamezawa | Noriki Nishida | Nobuyuki Shimizu | Takashi Miyazaki | Hideki Nakayama
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

In real-world dialogue, first-person visual information about where the other speakers are and what they are paying attention to is crucial to understand their intentions. Non-verbal responses also play an important role in social interactions. In this paper, we propose a visually-grounded first-person dialogue (VFD) dataset with verbal and non-verbal responses. The VFD dataset provides manually annotated (1) first-person images of agents, (2) utterances of human speakers, (3) eye-gaze locations of the speakers, and (4) the agents’ verbal and non-verbal responses. We present experimental results obtained using the proposed VFD dataset and recent neural network models (e.g., BERT, ResNet). The results demonstrate that first-person vision helps neural network models correctly understand human intentions, and the production of non-verbal responses is a challenging task like that of verbal responses. Our dataset is publicly available.

This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020). For the WAT2020, 20 teams participated in the shared tasks and 14 teams submitted their translation results for the human evaluation. We also received 12 research paper submissions out of which 7 were accepted. About 500 translation results were submitted to the automatic evaluation server, and selected submissions were manually evaluated.

pdf bib abs
Unsupervised Discourse Constituency Parsing Using Viterbi EM
Noriki Nishida | Hideki Nakayama
Transactions of the Association for Computational Linguistics, Volume 8

In this paper, we introduce an unsupervised discourse constituency parsing algorithm. We use Viterbi EM with a margin-based criterion to train a span-based discourse parser in an unsupervised manner. We also propose initialization methods for Viterbi training of discourse constituents based on our prior knowledge of text structures. Experimental results demonstrate that our unsupervised parser achieves comparable or even superior performance to fully supervised parsers. We also investigate discourse constituents that are learned by our method.

pdf bib abs
Supervised Visual Attention for Multimodal Neural Machine Translation
Tetsuro Nishihara | Akihiro Tamura | Takashi Ninomiya | Yutaro Omote | Hideki Nakayama
Proceedings of the 28th International Conference on Computational Linguistics

This paper proposed a supervised visual attention mechanism for multimodal neural machine translation (MNMT), trained with constraints based on manual alignments between words in a sentence and their corresponding regions of an image. The proposed visual attention mechanism captures the relationship between a word and an image region more precisely than a conventional visual attention mechanism trained through MNMT in an unsupervised manner. Our experiments on English-German and German-English translation tasks using the Multi30k dataset and on English-Japanese and Japanese-English translation tasks using the Flickr30k Entities JP dataset show that a Transformer-based MNMT model can be improved by incorporating our proposed supervised visual attention mechanism and that further improvements can be achieved by combining it with a supervised cross-lingual attention mechanism (up to +1.61 BLEU, +1.7 METEOR).

pdf bib abs
A Visually-Grounded Parallel Corpus with Phrase-to-Region Linking
Hideki Nakayama | Akihiro Tamura | Takashi Ninomiya
Proceedings of the 12th Language Resources and Evaluation Conference

Visually-grounded natural language processing has become an important research direction in the past few years. However, majorities of the available cross-modal resources (e.g., image-caption datasets) are built in English and cannot be directly utilized in multilingual or non-English scenarios. In this study, we present a novel multilingual multimodal corpus by extending the Flickr30k Entities image-caption dataset with Japanese translations, which we name Flickr30k Entities JP (F30kEnt-JP). To the best of our knowledge, this is the first multilingual image-caption dataset where the captions in the two languages are parallel and have the shared annotations of many-to-many phrase-to-region linking. We believe that phrase-to-region as well as phrase-to-phrase supervision can play a vital role in fine-grained grounding of language and vision, and will promote many tasks such as multilingual image captioning and multimodal machine translation. To verify our dataset, we performed phrase localization experiments in both languages and investigated the effectiveness of our Japanese annotations as well as multilingual learning realized by our dataset.

pdf bib abs
Single Model Ensemble using Pseudo-Tags and Distinct Vectors
Ryosuke Kuwabara | Jun Suzuki | Hideki Nakayama
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Model ensemble techniques often increase task performance in neural networks; however, they require increased time, memory, and management effort. In this study, we propose a novel method that replicates the effects of a model ensemble with a single model. Our approach creates K-virtual models within a single parameter space using K-distinct pseudo-tags and K-distinct vectors. Experiments on text classification and sequence labeling tasks on several datasets demonstrate that our method emulates or outperforms a traditional model ensemble with 1/K-times fewer parameters.

2019

pdf bib abs
Generating Diverse Translations with Sentence Codes
Raphael Shu | Hideki Nakayama | Kyunghyun Cho
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Users of machine translation systems may desire to obtain multiple candidates translated in different ways. In this work, we attempt to obtain diverse translations by using sentence codes to condition the sentence generation. We describe two methods to extract the codes, either with or without the help of syntax information. For diverse generation, we sample multiple candidates, each of which conditioned on a unique code. Experiments show that the sampled translations have much higher diversity scores when using reasonable sentence codes, where the translation quality is still on par with the baselines even under strong constraint imposed by the codes. In qualitative analysis, we show that our method is able to generate paraphrase translations with drastically different structures. The proposed approach can be easily adopted to existing translation systems as no modification to the model is required.

pdf bib abs
Enabling Real-time Neural IME with Incremental Vocabulary Selection
Jiali Yao | Raphael Shu | Xinjian Li | Katsutoshi Ohtsuki | Hideki Nakayama
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers)

Input method editor (IME) converts sequential alphabet key inputs to words in a target language. It is an indispensable service for billions of Asian users. Although the neural-based language model is extensively studied and shows promising results in sequence-to-sequence tasks, applying a neural-based language model to IME was not considered feasible due to high latency when converting words on user devices. In this work, we articulate the bottleneck of neural IME decoding to be the heavy softmax computation over a large vocabulary. We propose an approach that incrementally builds a subset vocabulary from the word lattice. Our approach always computes the probability with a selected subset vocabulary. When the selected vocabulary is updated, the stale probabilities in previous steps are fixed by recomputing the missing logits. The experiments on Japanese IME benchmark shows an over 50x speedup for the softmax computations comparing to the baseline, reaching real-time speed even on commodity CPU without losing conversion accuracy. The approach is potentially applicable to other incremental sequence-to-sequence decoding tasks such as real-time continuous speech recognition.

2018

pdf bib abs
Improving Beam Search by Removing Monotonic Constraint for Neural Machine Translation
Raphael Shu | Hideki Nakayama
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

To achieve high translation performance, neural machine translation models usually rely on the beam search algorithm for decoding sentences. The beam search finds good candidate translations by considering multiple hypotheses of translations simultaneously. However, as the algorithm produces hypotheses in a monotonic left-to-right order, a hypothesis can not be revisited once it is discarded. We found such monotonicity forces the algorithm to sacrifice some good decoding paths. To mitigate this problem, we relax the monotonic constraint of the beam search by maintaining all found hypotheses in a single priority queue and using a universal score function for hypothesis selection. The proposed algorithm allows discarded hypotheses to be recovered in a later step. Despite its simplicity, we show that the proposed decoding algorithm enhances the quality of selected hypotheses and improve the translations even for high-performance models in English-Japanese translation task.

pdf bib abs
Coherence Modeling Improves Implicit Discourse Relation Recognition
Noriki Nishida | Hideki Nakayama
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

The research described in this paper examines how to learn linguistic knowledge associated with discourse relations from unlabeled corpora. We introduce an unsupervised learning method on text coherence that could produce numerical representations that improve implicit discourse relation recognition in a semi-supervised manner. We also empirically examine two variants of coherence modeling: order-oriented and topic-oriented negative sampling, showing that, of the two, topic-oriented negative sampling tends to be more effective.

pdf bib
Augmenting Image Question Answering Dataset by Exploiting Image Captions
Masashi Yokota | Hideki Nakayama
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Incorporating Semantic Attention in Video Description Generation
Natsuda Laokulrat | Naoaki Okazaki | Hideki Nakayama
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib abs
Word Ordering as Unsupervised Learning Towards Syntactically Plausible Word Representations
Noriki Nishida | Hideki Nakayama
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The research question we explore in this study is how to obtain syntactically plausible word representations without using human annotations. Our underlying hypothesis is that word ordering tests, or linearizations, is suitable for learning syntactic knowledge about words. To verify this hypothesis, we develop a differentiable model called Word Ordering Network (WON) that explicitly learns to recover correct word order while implicitly acquiring word embeddings representing syntactic knowledge. We evaluate the word embeddings produced by the proposed method on downstream syntax-related tasks such as part-of-speech tagging and dependency parsing. The experimental results demonstrate that the WON consistently outperforms both order-insensitive and order-sensitive baselines on these tasks.

pdf bib abs
An Empirical Study of Adequate Vision Span for Attention-Based Neural Machine Translation
Raphael Shu | Hideki Nakayama
Proceedings of the First Workshop on Neural Machine Translation

Recently, the attention mechanism plays a key role to achieve high performance for Neural Machine Translation models. However, as it computes a score function for the encoder states in all positions at each decoding step, the attention model greatly increases the computational complexity. In this paper, we investigate the adequate vision span of attention models in the context of machine translation, by proposing a novel attention framework that is capable of reducing redundant score computation dynamically. The term “vision span”’ means a window of the encoder states considered by the attention model in one step. In our experiments, we found that the average window size of vision span can be reduced by over 50% with modest loss in accuracy on English-Japanese and German-English translation tasks.

2016

Automatic video description generation has recently been getting attention after rapid advancement in image caption generation. Automatically generating description for a video is more challenging than for an image due to its temporal dynamics of frames. Most of the work relied on Recurrent Neural Network (RNN) and recently attentional mechanisms have also been applied to make the model learn to focus on some frames of the video while generating each word in a describing sentence. In this paper, we focus on a sequence-to-sequence approach with temporal attention mechanism. We analyze and compare the results from different attention model configuration. By applying the temporal attention mechanism to the system, we can achieve a METEOR score of 0.310 on Microsoft Video Description dataset, which outperformed the state-of-the-art system so far.