Takehito Utsuro

2025

pdf bib abs
RAG based Question Answering of Korean Laws and Precedents
Kiho Seo | Takehito Utsuro
Proceedings of the Eighth Fact Extraction and VERification Workshop (FEVER)

We propose a method of improving the performance of question answering based on the interpretation of criminal law regulations in the Korean language by using large language models. In this study, we develop a system that accumulates legislative texts and case precedents related to criminal procedures published on the Internet.The system searches for relevant legal provisions and precedents related to the queryunder the RAG (Retrieval-Augmented Generation) framework.It generates accurate responses to questions by conducting reasoning through large language modelsbased on these relevant laws and precedents. As an application example of this system, it can be utilized to support decision makingin investigations and legal interpretation scenarios within the field of Korean criminal law.

pdf bib abs
BiMax: Bidirectional MaxSim Score for Document-Level Alignment
Xiaotian Wang | Takehito Utsuro | Masaaki Nagata
Findings of the Association for Computational Linguistics: EMNLP 2025

Document alignment is necessary for the hierarchical mining, which aligns documents across source and target languages within the same web domain. Several high-precision sentence embedding-based methods have been developed, such as TK-PERT and Optimal Transport (OT). However, given the massive scale of web mining data, both accuracy and speed must be considered.In this paper, we propose a cross-lingual Bidirectional Maxsim score (BiMax) for computing doc-to-doc similarity,to improve efficiency compared to the OT method.Consequently, on the WMT16 bilingual document alignment task,BiMax attains accuracy comparable to OT with an approximate 100-fold speed increase.Meanwhile, we also conduct a comprehensive analysis to investigate the performance of current state-of-the-art multilingual sentence embedding models.

pdf bib abs
Generating Financial News Articles from Factors of Stock Price Rise / Decline by LLMs
Shunsuke Nishida | Takehito Utsuro
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

In this paper, we study the task of generating financial news articles related to stock price fluctuations. Traditionally, reporters manually write these articles by identifying the causes behind significant stock price volatility. However, this process is time-consuming, limiting the number of articles produced. To address this, the study explores the use of generative AI to automatically generate such articles. The AI system, similar to human reporters, would analyze stock price volatility and determine the underlying factors contributing to these fluctuations. To support this approach, we introduces a Japanese dataset called JFinSR, which includes stock price fluctuation rankings from “Kabutan” and related financial information regarding factors of stock price rise / decline from “Nihon Keizai Shimbun (Nikkei).” Using this dataset, we implement the few-shot learning technique on large language models (LLMs) to enable automatic generation of high-quality articles from factors of stock price rise / decline that are available in Nikkei. In the evaluation, we compare zero-shot and few-shot learning approaches, where the few-shot learning achieved the higher F1 scores in terms of ROUGE-1/ROUGE-L metrics.

pdf bib abs
A Dialogue System for Semi-Structured Interviews by LLMs and its Evaluation on Persona Information Collection
Ryo Hasegawa | Yijie Hua | Takehito Utsuro | Ekai Hashimoto | Mikio Nakano | Shun Shiramatsu
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

In this paper, we propose a dialogue control management framework using large language models for semi-structured interviews. Specifically, large language models are used to generate the interviewer’s utterances and to make conditional branching decisions based on the understanding of the interviewee’s responses. The framework enables flexible dialogue control in interview conversations by generating and updating slots and values according to interviewee answers. More importantly, we invented through LLMs’ prompt tuning the framework of accumulating the list of slots generated along the course of incrementing the number of interviewees through the semi-structured interviews. Evaluation results showed that the proposed approach of accumulating the list of generated slots throughout the semi-structured interviews outperform the baseline without accumulating generated slots in terms of the number of persona attributes and values collected through the semi-structured interview.

pdf bib abs
Patent Claim Translation via Continual Pre-training of Large Language Models with Parallel Data
Haruto Azami | Minato Kondo | Takehito Utsuro | Masaaki Nagata
Proceedings of Machine Translation Summit XX: Volume 1

Recent advancements in large language models (LLMs) have enabled their application across various domains. However, in the field of patent translation, Transformer encoder-decoder based models remain the standard approach, and the potential of LLMs for translation tasks has not been thoroughly explored. In this study, we conducted patent claim translation using an LLM fine-tuned with parallel data through continual pre-training and supervised fine-tuning, following the methodology proposed by Guo et al. (2024) and Kondo et al. (2024). Comparative evaluation against the Transformer encoder-decoder based translations revealed that the LLM achieved high scores for both BLEU and COMET. This demonstrated improvements in addressing issues such as omissions and repetitions. Nonetheless, hallucination errors, which were not observed in the traditional models, occurred in some cases and negatively affected the translation quality. This study highlights the promise of LLMs for patent translation while identifying the challenges that warrant further investigation.

pdf bib abs
Improving Japanese-English Patent Claim Translation with Clause Segmentation Models based on Word Alignment
Masato Nishimura | Kosei Buma | Takehito Utsuro | Masaaki Nagata
Proceedings of Machine Translation Summit XX: Volume 1

In patent documents, patent claims represent a particularly important section as they define the scope of the claims. However, due to the length and unique formatting of these sentences, neural machine translation (NMT) systems are prone to translation errors, such as omissions and repetitions. To address these challenges, this study proposes a translation method that first segments the source sentences into multiple shorter clauses using a clause segmentation model tailored to facilitate translation. These segmented clauses are then translated using a clause translation model specialized for clause-level translation. Finally, the translated clauses are rearranged and edited into the final translation using a reordering and editing model. In addition, this study proposes a method for constructing clause-level parallel corpora required for training the clause segmentation and clause translation models. This method leverages word alignment tools to create clause-level data from sentence-level parallel corpora. Experimental results demonstrate that the proposed method achieves statistically significant improvements in BLEU scores compared to conventional NMT models. Furthermore, for sentences where conventional NMT models exhibit omissions and repetitions, the proposed method effectively suppresses these errors, enabling more accurate translations.

This paper presents the submission of NTTSU for the constrained track of the English–Japanese and Japanese–Chinese at the WMT2025 general translation task.For each translation direction, we build translation models from a large language model by combining continual pretraining, supervised fine-tuning, and preference optimization based on the translation quality and adequacy.We finally generate translations via context-aware MBR decoding to maximize translation quality and document-level consistency.

2024

pdf bib abs
Document Alignment based on Overlapping Fixed-Length Segments
Xiaotian Wang | Takehito Utsuro | Masaaki Nagata
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Acquiring large-scale parallel corpora is crucial for NLP tasks such as Neural Machine Translation, and web crawling has become a popular methodology for this purpose. Previous studies have been conducted based on sentence-based segmentation (SBS) when aligning documents in various languages which are obtained through web crawling. Among them, the TK-PERT method (Thompson and Koehn, 2020) achieved state-of-the-art results and addressed the boilerplate text in web crawling data well through a down-weighting approach. However, there remains a problem with how to handle long-text encoding better. Thus, we introduce the strategy of Overlapping Fixed-Length Segmentation (OFLS) in place of SBS, and observe a pronounced enhancement when performing the same approach for document alignment. In this paper, we compare the SBS and OFLS using three previous methods, Mean-Pool, TK-PERT (Thompson and Koehn, 2020), and Optimal Transport (Clark et al., 2019; El-Kishky and Guzman, 2020), on the WMT16 document alignment shared task for French-English, as well as on our self-established Japanese-English dataset MnRN. As a result, for the WMT16 task, various SBS based methods showed an increase in recall by 1% to 10% after reproduction with OFLS. For MnRN data, OFLS demonstrated notable accuracy improvements and exhibited faster document embedding speed.

pdf bib abs
Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data
Minato Kondo | Takehito Utsuro | Masaaki Nagata
Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)

In this paper, we propose a two-phase training approach where pre-trained large language models are continually pre-trained on parallel data and then supervised fine-tuned with a small amount of high-quality parallel data. To investigate the effectiveness of our proposed approach, we conducted continual pre-training with a 3.8B-parameter model and parallel data across eight different formats. We evaluate these methods on thirteen test sets for Japanese-to-English and English-to-Japanese translation. The results demonstrate that when utilizing parallel data in continual pre-training, it is essential to alternate between source and target sentences. Additionally, we demonstrated that the translation accuracy improves only for translation directions where the order of source and target sentences aligns between continual pre-training data and inference. In addition, we demonstrate that the LLM-based translation model is more robust in translating spoken language and achieves higher accuracy with less training data compared to supervised encoder-decoder models. We also show that the highest accuracy is achieved when the data for continual pre-training consists of interleaved source and target sentences and when tags are added to the source sentences.

pdf bib abs
Aggregating Impressions on Celebrities and their Reasons from Microblog Posts and Web Search Pages
Hibiki Yokoyama | Rikuto Tsuchida | Kosei Buma | Sho Miyakawa | Takehito Utsuro | Masaharu Yoshioka
Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP

This paper aims to augment fans’ ability to critique and exploreinformation related to celebrities of interest. First, we collect postsfrom X (formerly Twitter) that discuss matters related to specificcelebrities. For the collection of major impressions from these posts,we employ ChatGPT as a large language model (LLM) to analyze andsummarize key sentiments. Next, based on collected impressions, wesearch for Web pages and collect the content of the top 30 ranked pagesas the source for exploring the reasons behind those impressions. Oncethe Web page content collection is complete, we collect and aggregatedetailed reasons for the impressions on the celebrities from the contentof each page. For this part, we continue to use ChatGPT, enhanced bythe retrieval augmented generation (RAG) framework, to ensure thereliability of the collected results compared to relying solely on theprior knowledge of the LLM. Evaluation results by comparing a referencethat is manually collected and aggregated reasons with those predictedby ChatGPT revealed that ChatGPT achieves high accuracy in reasoncollection and aggregation. Furthermore, we compared the performance ofChatGPT with an existing model of mT5 in reason collection and confirmedthat ChatGPT exhibits superior performance.

pdf bib abs
Coding Open-Ended Responses using Pseudo Response Generation by Large Language Models
Yuki Zenimoto | Ryo Hasegawa | Takehito Utsuro | Masaharu Yoshioka | Noriko Kando
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Survey research using open-ended responses is an important method thatcontributes to the discovery of unknown issues and new needs. However,survey research generally requires time and cost-consuming manual dataprocessing, indicating that it is difficult to analyze large dataset.To address this issue, we propose an LLM-based method to automate partsof the grounded theory approach (GTA), a representative approach of thequalitative data analysis. We generated and annotated pseudo open-endedresponses, and used them as the training data for the coding proceduresof GTA. Through evaluations, we showed that the models trained withpseudo open-ended responses are quite effective compared with thosetrained with manually annotated open-ended responses. We alsodemonstrate that the LLM-based approach is highly efficient andcost-saving compared to human-based approach.

pdf bib
Emoji Prediction of Japanese X Posts by LLMs
Yijie Hua | Takehito Utsuro
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation

pdf bib abs
Embedded Topic Models Enhanced by Wikification
Takashi Shibuya | Takehito Utsuro
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia

Topic modeling analyzes a collection of documents to learn meaningful patterns of words.However, previous topic models consider only the spelling of words and do not take into consideration the polysemy of words.In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities.We evaluate our method on two datasets, 1) news articles of New York Times and 2) the AIDA-CoNLL dataset.Our experiments show that our method improves the performance of neural topic models in generalizability.Moreover, we analyze frequent words in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.

The NTTSU team’s submission leverages several large language models developed through a training procedure that includes continual pre-training and supervised fine-tuning. For paragraph-level translation, we generated synthetic paragraph-aligned data and utilized this data for training.In the task of translating Japanese to Chinese, we particularly focused on the speech domain translation. Specifically, we built Whisper models for Japanese automatic speech recognition (ASR). We used YODAS dataset for Whisper training. Since this data contained many noisy data pairs, we combined the Whisper outputs using ROVER for polishing the transcriptions. Furthermore, to enhance the robustness of the translation model against errors in the transcriptions, we performed data augmentation by forward translation from audio, using both ASR and base translation models.To select the best translation from multiple hypotheses of the models, we applied Minimum Bayes Risk decoding + reranking, incorporating scores such as COMET-QE, COMET, and cosine similarity by LaBSE.

2023

pdf bib abs
Headline Generation for Stock Price Fluctuation Articles
Shunsuke Nishida | Yuki Zenimoto | Xiaotian Wang | Takuya Tamura | Takehito Utsuro
Proceedings of the Sixth Workshop on Financial Technology and Natural Language Processing

The purpose of this paper is to construct a model for the generation of sophisticated headlines pertaining to stock price fluctuation articles, derived from the articles’ content. With respect to this headline generation objective, this paper solves three distinct tasks: in addition to the task of generating article headlines, two other tasks of extracting security names, and ascertaining the trajectory of stock prices, whether they are rising or declining. Regarding the headline generation task, we also revise the task as the model utilizes the outcomes of the security name extraction and rise/decline determination tasks, thereby for the purpose of preventing the inclusion of erroneous security names. We employed state-of-the-art pre-trained models from the field of natural language processing, fine-tuning these models for each task to enhance their precision. The dataset utilized for fine-tuning comprises a collection of articles delineating the rise and decline of stock prices. Consequently, we achieved remarkably high accuracy in the dual tasks of security name extraction and stock price rise or decline determination. For the headline generation task, a significant portion of the test data yielded fitting headlines.

pdf bib
Style-sensitive Sentence Embeddings for Evaluating Similarity in Speech Style of Japanese Sentences by Contrastive Learning
Yuki Zenimoto | Shinzan Komata | Takehito Utsuro
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib abs
Target Language Monolingual Translation Memory based NMT by Cross-lingual Retrieval of Similar Translations and Reranking
Takuya Tamura | Xiaotian Wang | Takehito Utsuro | Masaaki Nagata
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Retrieve-edit-rerank is a text generation framework composed of three steps: retrieving for sentences using the input sentence as a query, generating multiple output sentence candidates, and selecting the final output sentence from these candidates. This simple approach has outperformed other existing and more complex methods. This paper focuses on the retrieving and the reranking steps. In the retrieving step, we propose retrieving similar target language sentences from a target language monolingual translation memory using language-independent sentence embeddings generated by mSBERT or LaBSE. We demonstrate that this approach significantly outperforms existing methods that use monolingual inter-sentence similarity measures such as edit distance, which is only applicable to a parallel translation memory. In the reranking step, we propose a new reranking score for selecting the best sentences, which considers both the log-likelihood of each candidate and the sentence embeddings based similarity between the input and the candidate. We evaluated the proposed method for English-to-Japanese translation on the ASPEC and English-to-French translation on the EU Bookshop Corpus (EUBC). The proposed method significantly exceeded the baseline in BLEU score, especially observing a 1.4-point improvement in the EUBC dataset over the original Retrieve-Edit-Rerank method.

pdf bib abs
Leveraging Highly Accurate Word Alignment for Low Resource Translation by Pretrained Multilingual Model
Jingyi Zhu | Minato Kondo | Takuya Tamura | Takehito Utsuro | Masaaki Nagata
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Recently, there has been a growing interest in pretraining models in the field of natural language processing. As opposed to training models from scratch, pretrained models have been shown to produce superior results in low-resource translation tasks. In this paper, we introduced the use of pretrained seq2seq models for preordering and translation tasks. We utilized manual word alignment data and mBERT-based generated word alignment data for training preordering and compared the effectiveness of various types of mT5 and mBART models for preordering. For the translation task, we chose mBART as our baseline model and evaluated several input manners. Our approach was evaluated on the Asian Language Treebank dataset, consisting of 20,000 parallel data in Japanese, English and Hindi, where Japanese is either on the source or target side. We also used in-house 3,000 parallel data in Chinese and Japanese. The results indicated that mT5-large trained with manual word alignment achieved a preordering performance exceeding 0.9 RIBES score on Ja-En and Ja-Zh pairs. Moreover, our proposed approach significantly outperformed the baseline model in most translation directions of Ja-En, Ja-Zh, and Ja-Hi pairs in at least one of BLEU/COMET scores.

pdf bib
Large Scale Evaluation of End-to-End Pipeline of Speaker to Dialogue Attribution in Japanese Novels
Yuki Zenimoto | Shinzan Komata | Takehito Utsuro
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib
Enhanced Retrieve-Edit-Rerank Framework with kNN-MT
Xiaotian Wang | Takuya Tamura | Takehito Utsuro | Masaaki Nagata
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

pdf bib abs
Detecting Causes of Stock Price Rise and Decline by Machine Reading Comprehension with BERT
Gakuto Tsutsumi | Takehito Utsuro
Proceedings of the 4th Financial Narrative Processing Workshop @LREC2022

In this paper, we focused on news reported when stock prices fluctuate significantly. The news reported when stock prices change is a very useful source of information on what factors cause stock prices to change. However, because it is manually produced, not all events that cause stock prices to change are necessarily reported. Thus, in order to provide investors with information on those causes of stock price changes, it is necessary to develop a system to collect information on events that could be closely related to the stock price changes of certain companies from the Internet. As the first step towards developing such a system, this paper takes an approach of employing a BERT-based machine reading comprehension model, which extracts causes of stock price rise and decline from news reports on stock price changes. In the evaluation, the approach of using the title of the article as the question of machine reading comprehension performs well. It is shown that the fine-tuned machine reading comprehension model successfully detects additional causes of stock price rise and decline other than those stated in the title of the article.

pdf bib
Speaker Identification of Quotes in Japanese Novels based on Gender Classification Model by BERT
Yuki Zenimoto | Takehito Utsuro
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib
Developing and Evaluating a Dataset for How-to Tip Machine Reading at Scale
Fuzhu Zhu | Shuting Bai | Tingxuan Li | Takehito Utsuro
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

pdf bib
Tweet Review Mining focusing on Celebrities by Machine Reading Comprehension based on BERT
Yuta Nozaki | Kotoe Sugawara | Yuki Zenimoto | Takehito Utsuro
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

2021

pdf bib
Evaluating a How-to Tip Machine Comprehension Model with QA Examples collected from a Community QA Site
Tingxuan Li | Shuting Bai | Fuzhu Zhu | Takehito Utsuro
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf bib abs
MRC Examples Answerable by BERT without a Question Are Less Effective in MRC Model Training
Hongyu Li | Tengyang Chen | Shuting Bai | Takehito Utsuro | Yasuhide Kawada
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop

Models developed for Machine Reading Comprehension (MRC) are asked to predict an answer from a question and its related context. However, there exist cases that can be correctly answered by an MRC model using BERT, where only the context is provided without including the question. In this paper, these types of examples are referred to as “easy to answer”, while others are as “hard to answer”, i.e., unanswerable by an MRC model using BERT without being provided the question. Based on classifying examples as answerable or unanswerable by BERT without the given question, we propose a method based on BERT that splits the training examples from the MRC dataset SQuAD1.1 into those that are “easy to answer” or “hard to answer”. Experimental evaluation from a comparison of two models, one trained only with “easy to answer” examples and the other with “hard to answer” examples demonstrates that the latter outperforms the former.

pdf bib abs
Developing a How-to Tip Machine Comprehension Dataset and its Evaluation in Machine Comprehension by BERT
Tengyang Chen | Hongyu Li | Miho Kasamatsu | Takehito Utsuro | Yasuhide Kawada
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)

In the field of factoid question answering (QA), it is known that the state-of-the-art technology has achieved an accuracy comparable to that of humans in a certain benchmark challenge. On the other hand, in the area of non-factoid QA, there is still a limited number of datasets for training QA models, i.e., machine comprehension models. Considering such a situation within the field of the non-factoid QA, this paper aims to develop a dataset for training Japanese how-to tip QA models. This paper applies one of the state-of-the-art machine comprehension models to the Japanese how-to tip QA dataset. The trained how-to tip QA model is also compared with a factoid QA model trained with a Japanese factoid QA dataset. Evaluation results revealed that the how-to tip machine comprehension performance was almost comparative with that of the factoid machine comprehension even with the training data size reduced to around 4% of the factoid machine comprehension. Thus, the how-to tip machine comprehension task requires much less training data compared with the factoid machine comprehension task.

pdf bib abs
Automatic Annotation of Werewolf Game Corpus with Players Revealing Oneselves as Seer/Medium and Divination/Medium Results
Youchao Lin | Miho Kasamatsu | Tengyang Chen | Takuya Fujita | Huanjin Deng | Takehito Utsuro
Workshop on Games and Natural Language Processing

While playing the communication game “Are You a Werewolf”, a player always guesses other players’ roles through discussions, based on his own role and other players’ crucial utterances. The underlying goal of this paper is to construct an agent that can analyze the participating players’ utterances and play the werewolf game as if it is a human. For a step of this underlying goal, this paper studies how to accumulate werewolf game log data annotated with identification of players revealing oneselves as seer/medium, the acts of the divination and the medium and declaring the results of the divination and the medium. In this paper, we divide the whole task into four sub tasks and apply CNN/SVM classifiers to each sub task and evaluate their performance.

pdf bib abs
University of Tsukuba’s Machine Translation System for IWSLT20 Open Domain Translation Task
Hongyi Cui | Yizhen Wei | Shohei Iida | Takehito Utsuro | Masaaki Nagata
Proceedings of the 17th International Conference on Spoken Language Translation

In this paper, we introduce University of Tsukuba’s submission to the IWSLT20 Open Domain Translation Task. We participate in both Chinese→Japanese and Japanese→Chinese directions. For both directions, our machine translation systems are based on the Transformer architecture. Several techniques are integrated in order to boost the performance of our models: data filtering, large-scale noised training, model ensemble, reranking and postprocessing. Consequently, our efforts achieve 33.0 BLEU scores for Chinese→Japanese translation and 32.3 BLEU scores for Japanese→Chinese translation.

pdf bib abs
Integrating Disfluency-based and Prosodic Features with Acoustics in Automatic Fluency Evaluation of Spontaneous Speech
Huaijin Deng | Youchao Lin | Takehito Utsuro | Akio Kobayashi | Hiromitsu Nishizaki | Junichi Hoshino
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper describes an automatic fluency evaluation of spontaneous speech. In the task of automatic fluency evaluation, we integrate diverse features of acoustics, prosody, and disfluency-based ones. Then, we attempt to reveal the contribution of each of those diverse features to the task of automatic fluency evaluation. Although a variety of different disfluencies are observed regularly in spontaneous speech, we focus on two types of phenomena, i.e., filled pauses and word fragments. The experimental results demonstrate that the disfluency-based features derived from word fragments and filled pauses are effective relative to evaluating fluent/disfluent speech, especially when combined with prosodic features, e.g., such as speech rate and pauses/silence. Next, we employed an LSTM based framework in order to integrate the disfluency-based and prosodic features with time sequential acoustic features. The experimental evaluation results of those integrated diverse features indicate that time sequential acoustic features contribute to improving the model with disfluency-based and prosodic features when detecting fluent speech, but not when detecting disfluent speech. Furthermore, when detecting disfluent speech, the model without time sequential acoustic features performs best even without word fragments features, but only with filled pauses and prosodic features.

pdf bib
Text Mining of Evidence on Infants’ Developmental Stages for Developmental Order Acquisition from Picture Book Reviews
Miho Kasamatsu | Takehito Utsuro | Yu Saito | Yumiko Ishikawa
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2019

pdf bib abs
Mixed Multi-Head Self-Attention for Neural Machine Translation
Hongyi Cui | Shohei Iida | Po-Hsuan Hung | Takehito Utsuro | Masaaki Nagata
Proceedings of the 3rd Workshop on Neural Generation and Translation

Recently, the Transformer becomes a state-of-the-art architecture in the filed of neural machine translation (NMT). A key point of its high-performance is the multi-head self-attention which is supposed to allow the model to independently attend to information from different representation subspaces. However, there is no explicit mechanism to ensure that different attention heads indeed capture different features, and in practice, redundancy has occurred in multiple heads. In this paper, we argue that using the same global attention in multiple heads limits multi-head self-attention’s capacity for learning distinct features. In order to improve the expressiveness of multi-head self-attention, we propose a novel Mixed Multi-Head Self-Attention (MMA) which models not only global and local attention but also forward and backward attention in different attention heads. This enables the model to learn distinct representations explicitly among multiple heads. In our experiments on both WAT17 English-Japanese as well as IWSLT14 German-English translation task, we show that, without increasing the number of parameters, our models yield consistent and significant improvements (0.9 BLEU scores on average) over the strong Transformer baseline.

pdf bib abs
Attention over Heads: A Multi-Hop Attention for Neural Machine Translation
Shohei Iida | Ryuichiro Kimura | Hongyi Cui | Po-Hsuan Hung | Takehito Utsuro | Masaaki Nagata
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

In this paper, we propose a multi-hop attention for the Transformer. It refines the attention for an output symbol by integrating that of each head, and consists of two hops. The first hop attention is the scaled dot-product attention which is the same attention mechanism used in the original Transformer. The second hop attention is a combination of multi-layer perceptron (MLP) attention and head gate, which efficiently increases the complexity of the model by adding dependencies between heads. We demonstrate that the translation accuracy of the proposed multi-hop attention outperforms the baseline Transformer significantly, +0.85 BLEU point for the IWSLT-2017 German-to-English task and +2.58 BLEU point for the WMT-2017 German-to-English task. We also find that the number of parameters required for a multi-hop attention is smaller than that for stacking another self-attention layer and the proposed model converges significantly faster than the original Transformer.

pdf bib
Proceedings of the 8th Workshop on Patent and Scientific Literature Translation
Takehito Utsuro | Katsuhito Sudoh | Takashi Tsunakawa
Proceedings of the 8th Workshop on Patent and Scientific Literature Translation

2018

pdf bib abs
Measuring Beginner Friendliness of Japanese Web Pages explaining Academic Concepts by Integrating Neural Image Feature and Text Features
Hayato Shiokawa | Kota Kawaguchi | Bingcai Han | Takehito Utsuro | Yasuhide Kawada | Masaharu Yoshioka | Noriko Kando
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

Search engine is an important tool of modern academic study, but the results are lack of measurement of beginner friendliness. In order to improve the efficiency of using search engine for academic study, it is necessary to invent a technique of measuring the beginner friendliness of a Web page explaining academic concepts and to build an automatic measurement system. This paper studies how to integrate heterogeneous features such as a neural image feature generated from the image of the Web page by a variant of CNN (convolutional neural network) as well as text features extracted from the body text of the HTML file of the Web page. Integration is performed through the framework of the SVM classifier learning. Evaluation results show that heterogeneous features perform better than each individual type of features.

2017

pdf bib
Neural Machine Translation Model with a Large Vocabulary Selected by Branching Entropy
Zi Long | Ryuichiro Kimura | Takehito Utsuro | Tomoharu Mitsuhashi | Mikio Yamamoto
Proceedings of Machine Translation Summit XVI: Research Track

pdf bib abs
Patent NMT integrated with Large Vocabulary Phrase Translation by SMT at WAT 2017
Zi Long | Ryuichiro Kimura | Takehito Utsuro | Tomoharu Mitsuhashi | Mikio Yamamoto
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

Neural machine translation (NMT) cannot handle a larger vocabulary because the training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. Long et al.(2017) proposed to select phrases that contain out-of-vocabulary words using the statistical approach of branching entropy. The selected phrases are then replaced with tokens during training and post-translated by the phrase translation table of SMT. In this paper, we apply the method proposed by Long et al. (2017) to the WAT 2017 Japanese-Chinese and Japanese-English patent datasets. Evaluation on Japanese-to-Chinese, Chinese-to-Japanese, Japanese-to-English and English-to-Japanese patent sentence translation proved the effectiveness of phrases selected with branching entropy, where the NMT model of Long et al.(2017) achieves a substantial improvement over a baseline NMT model without the technique proposed by Long et al.(2017).

2016

pdf bib abs
Analyzing Time Series Changes of Correlation between Market Share and Concerns on Companies measured through Search Engine Suggests
Takakazu Imada | Yusuke Inoue | Lei Chen | Syunya Doi | Tian Nie | Chen Zhao | Takehito Utsuro | Yasuhide Kawada
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper proposes how to utilize a search engine in order to predict market shares. We propose to compare rates of concerns of those who search for Web pages among several companies which supply products, given a specific products domain. We measure concerns of those who search for Web pages through search engine suggests. Then, we analyze whether rates of concerns of those who search for Web pages have certain correlation with actual market share. We show that those statistics have certain correlations. We finally propose how to predict the market share of a specific product genre based on the rates of concerns of those who search for Web pages.

pdf bib abs
Translation of Patent Sentences with a Large Vocabulary of Technical Terms Using Neural Machine Translation
Zi Long | Takehito Utsuro | Tomoharu Mitsuhashi | Mikio Yamamoto
Proceedings of the 3rd Workshop on Asian Translation (WAT2016)

Neural machine translation (NMT), a new approach to machine translation, has achieved promising results comparable to those of traditional approaches such as statistical machine translation (SMT). Despite its recent success, NMT cannot handle a larger vocabulary because training complexity and decoding complexity proportionally increase with the number of target words. This problem becomes even more serious when translating patent documents, which contain many technical terms that are observed infrequently. In NMTs, words that are out of vocabulary are represented by a single unknown token. In this paper, we propose a method that enables NMT to translate patent sentences comprising a large vocabulary of technical terms. We train an NMT system on bilingual data wherein technical terms are replaced with technical term tokens; this allows it to translate most of the source sentences except technical terms. Further, we use it as a decoder to translate source sentences with technical term tokens and replace the tokens with technical term translations using SMT. We also use it to rerank the 1,000-best SMT translations on the basis of the average of the SMT score and that of the NMT rescoring of the translated sentences with technical term tokens. Our experiments on Japanese-Chinese patent sentences show that the proposed NMT system achieves a substantial improvement of up to 3.1 BLEU points and 2.3 RIBES points over traditional SMT systems and an improvement of approximately 0.6 BLEU points and 0.8 RIBES points over an equivalent NMT system without our proposed technique.

The Japanese language has various types of functional expressions. In order to organize Japanese functional expressions with various surface forms, a lexicon of Japanese functional expressions with hierarchical organization was compiled. This paper proposes how to design the framework of identifying more than 16,000 functional expressions in Japanese texts by utilizing hierarchical organization of the lexicon. In our framework, more than 16,000 functional expressions are roughly divided into canonical / derived functional expressions. Each derived functional expression is intended to be identified by referring to the most similar occurrence of its canonical expression. In our framework, contextual occurrence information of much fewer canonical expressions are expanded into the whole forms of derived expressions, to be utilized when identifying those derived expressions. We also empirically show that the proposed method can correctly identify more than 80% of the functional / content usages only with less than 38,000 training instances of manually identified canonical expressions.

2011

pdf bib
Semi-Automatic Identification of Bilingual Synonymous Technical Terms from Phrase Tables and Parallel Patent Sentences
Bing Liang | Takehito Utsuro | Mikio Yamamoto
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

2010

pdf bib abs
Utilizing Semantic Equivalence Classes of Japanese Functional Expressions in Translation Rule Acquisition from Parallel Patent Sentences
Taiji Nagasaka | Ran Shimanouchi | Akiko Sakamoto | Takafumi Suzuki | Yohei Morishita | Takehito Utsuro | Suguru Matsuyoshi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In the ``Sandglass'' MT architecture, we identify the class of monosemous Japanese functional expressions and utilize it in the task of translating Japanese functional expressions into English. We employ the semantic equivalence classes of a recently compiled large scale hierarchical lexicon of Japanese functional expressions. We then study whether functional expressions within a class can be translated into a single canonical English expression. Based on the results of identifying monosemous semantic equivalence classes, this paper studies how to extract rules for translating functional expressions in Japanese patent documents into English. In this study, we use about 1.8M Japanese-English parallel sentences automatically extracted from Japanese-English patent families, which are distributed through the Patent Translation Task at the NTCIR-7 Workshop. Then, as a toolkit of a phrase-based SMT (Statistical Machine Translation) model, Moses is applied and Japanese-English translation pairs are obtained in the form of a phrase translation table. Finally, we extract translation pairs of Japanese functional expressions from the phrase translation table. Through this study, we found that most of the semantic equivalence classes judged as monosemous based on manual translation into English have only one translation rules even in the patent domain.

pdf bib
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)
Sadao Kurohashi | Takehito Utsuro
Proceedings of the Second Workshop on NLP Challenges in the Information Explosion Era (NLPIX 2010)

2009

pdf bib
Exploiting Patent Information for the Evaluation of Machine Translation
Atsushi Fujii | Masao Utiyama | Mikio Yamamoto | Takehito Utsuro
Proceedings of the Third Workshop on Patent Translation

pdf bib
Towards Conceptual Indexing of the Blogosphere through Wikipedia Topic Hierarchy
Mariko Kawaba | Daisuke Yokomoto | Hiroyuki Nakasaki | Takehito Utsuro | Tomohiro Fukuhara
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

pdf bib
Identifying and Utilizing the Class of Monosemous Japanese Functional Expressions in Machine Translation
Akiko Sakamoto | Taiji Nagasaka | Takehito Utsuro | Suguru Matsuyoshi
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 2

2008

pdf bib abs
Toward the Evaluation of Machine Translation Using Patent Information
Atsushi Fujii | Masao Utiyama | Mikio Yamamoto | Takehito Utsuro
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

To aid research and development in machine translation, we have produced a test collection for Japanese/English machine translation. To obtain a parallel corpus, we extracted patent documents for the same or related inventions published in Japan and the United States. Our test collection includes approximately 2000000 sentence pairs in Japanese and English, which were extracted automatically from our parallel corpus. These sentence pairs can be used to train and evaluate machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval, which can be used to evaluate the contribution of machine translation to retrieving patent documents across languages. This paper describes our test collection, methods for evaluating machine translation, and preliminary experiments.

pdf bib abs
Integrating a Phrase-based SMT Model and a Bilingual Lexicon for Semi-Automatic Acquisition of Technical Term Translation Lexicons
Yohei Morishita | Takehito Utsuro | Mikio Yamamoto
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

This paper presents an attempt at developing a technique of acquiring translation pairs of technical terms with sufficiently high precision from parallel patent documents. The approach taken in the proposed technique is based on integrating the phrase translation table of a state-of-the-art statistical phrase-based machine translation model, and compositional translation generation based on an existing bilingual lexicon for human use. Our evaluation results clearly show that the agreement between the two individual techniques definitely contribute to improving precision of translation candidates. We then apply the Support Vector Machines (SVMs) to the task of automatically validating translation candidates in the phrase translation table. Experimental evaluation results again show that the SVMs based approach to translation candidates validation can contribute to improving the precision of translation candidates in the phrase translation table.

pdf bib abs
Producing a Test Collection for Patent Machine Translation in the Seventh NTCIR Workshop
Atsushi Fujii | Masao Utiyama | Mikio Yamamoto | Takehito Utsuro
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In aiming at research and development on machine translation, we produced a test collection for Japanese-English machine translation in the seventh NTCIR Workshop. This paper describes details of our test collection. From patent documents published in Japan and the United States, we extracted patent families as a parallel corpus. A patent family is a set of patent documents for the same or related invention and these documents are usually filed to more than one country in different languages. In the parallel corpus, we aligned Japanese sentences with their counterpart English sentences. Our test collection, which includes approximately 2,000,000 sentence pairs, can be used to train and test machine translation systems. Our test collection also includes search topics for cross-lingual patent retrieval and the contribution of machine translation to a patent retrieval task can also be evaluated. Our test collection will be available to the public for research purposes after the NTCIR final meeting.

2007

pdf bib
Learning Dependency Relations of Japanese Compound Functional Expressions
Takehito Utsuro | Takao Shime | Masatoshi Tsuchiya | Suguru Matsuyoshi | Satoshi Sato
Proceedings of the Workshop on A Broader Perspective on Multiword Expressions

2006

pdf bib
Compiling French-Japanese Terminologies from the Web
Xavier Robitaille | Yasuhiro Sasaki | Masatsugu Tonoike | Satoshi Sato | Takehito Utsuro
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Japanese Idiom Recognition: Drawing a Line between Literal and Idiomatic Meanings
Chikara Hashimoto | Satoshi Sato | Takehito Utsuro
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf bib
Adjective-to-Verb Paraphrasing in Japanese Based on Lexical Constraints of Verbs
Atsushi Fujita | Naruaki Masuno | Satoshi Sato | Takehito Utsuro
Proceedings of the Fourth International Natural Language Generation Conference

pdf bib
A comparative study on compositional translation estimation using a domain/topic-specific corpus collected from the Web
Masatsugu Tonoike | Mitsuhiro Kida | Toshihiro Takagi | Yasuhiro Sasaki | Takehito Utsuro | S. Sato
Proceedings of the 2nd International Workshop on Web as Corpus

2005

pdf bib
Effect of Domain-Specific Corpus in Compositional Translation Estimation for Technical Terms
Masatsugu Tonoike | Mitsuhiro Kida | Toshihiro Takagi | Yasuhiro Sasaki | Takehito Utsuro | Satoshi Sato
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

2004

pdf bib
Integrating Cross-Lingually Relevant News Articles and Monolingual Web Documents in Bilingual Lexicon Acquisition
Takehito Utsuro | Kohei Hino | Mitsuhiro Kida | Seiichi Nakagawa | Satoshi Sato
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

pdf bib
An Empirical Study on Multiple LVCSR Model Combination by Machine Learning
Takehito Utsuro | Yasuhiro Kodama | Tomohiro Watanabe | Hiromitsu Nishizaki | Seiichi Nakagawa
Proceedings of HLT-NAACL 2004: Short Papers

pdf bib
Answer validation by keyword association
Masatsugu Tonoike | Takehito Utsuro | Satoshi Sato
Proceedings of the 3rd workshop on RObust Methods in Analysis of Natural Language Data (ROMAND 2004)

2003

pdf bib
Effect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora
Takehito Utsuro | Takashi Horiuchi | Kohei Hino | Takeshi Hamamoto | Takeaki Nakayama
10th Conference of the European Chapter of the Association for Computational Linguistics

2002

pdf bib abs
Semi-automatic compilation of bilingual lexcion entries from cross-lingually relevant news articles on WWW news sites
Takehito Utsuro | Takashi Horiuchi | Yasunobu Chiba | Takeshi Hamamoto
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

For the purpose of overcoming resource scarcity bottleneck in corpus-based translation knowledge acquisition research, this paper takes an approach of semi-automatically acquiring domain specific translation knowledge from the collection of bilingual news articles on WWW news sites. This paper presents results of applying standard co-occurrence frequency based techniques of estimating bilingual term correspondences from parallel corpora to relevant article pairs automatically collected from WWW news sites. The experimental evaluation results are very encouraging and it is proved that many useful bilingual term correspondences can be efficiently discovered with little human intervention from relevant article pairs on WWW news sites.

pdf bib
A Web-based English Abstract Writing Tool Using a Tagged E-J Parallel Corpus
Masumi Narita | Kazuya Kurokawa | Takehito Utsuro
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)

pdf bib
Combining Outputs of Multiple Japanese Named Entity Chunkers by Stacking
Takehito Utsuro | Manabu Sassano | Kiyotaka Uchimoto
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)