Tatsuya Kawahara

2024

pdf abs
Multilingual Turn-taking Prediction Using Voice Activity Projection
Koji Inoue | Bing’er Jiang | Erik Ekstedt | Tatsuya Kawahara | Gabriel Skantze
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

pdf abs
Quantitative Analysis of Editing in Transcription Process in Japanese and European Parliaments and its Diachronic Changes
Tatsuya Kawahara
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

In making official transcripts for meeting records in Parliament, some edits are made from faithful transcripts of utterances for linguistic correction and formality. Classification of these edits is provided in this paper, and quantitative analysis is conducted for Japanese and European Parliamentary meetings by comparing the faithful transcripts of audio recordings against the official meeting records. Different trends are observed between the two Parliaments due to the nature of the language used and the meeting style. Moreover, its diachronic changes in the Japanese transcripts are presented, showing a significant decrease in the edits over the past decades. It was found that a majority of edits in the Japanese Parliament (Diet) simply remove fillers and redundant words, keeping the transcripts as verbatim as possible. This property is useful for the evaluation of the automatic speech transcription system, which was developed by us and has been used in the Japanese Parliament.

pdf abs
Video Retrieval System Using Automatic Speech Recognition for the Japanese Diet
Mikitaka Masuyama | Tatsuya Kawahara | Kenjiro Matsuda
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

The Japanese House of Representatives, one of the two houses of the Diet, has adopted an Automatic Speech Recognition (ASR) system, which directly transcribes parliamentary speech with an accuracy of 95 percent. The ASR system also provides a timestamp for every word, which enables retrieval of the video segments of the Parliamentary meetings. The video retrieval system we have developed allows one to pinpoint and play the parliamentary video clips corresponding to the meeting minutes by keyword search. In this paper, we provide its overview and suggest various ways we can utilize the system. The system is currently extended to cover meetings of local governments, which will allow us to investigate dialectal linguistic variations.

2023

pdf
RealPersonaChat: A Realistic Persona Chat Corpus with Interlocutors’ Own Personalities
Sanae Yamashita | Koji Inoue | Ao Guo | Shota Mochizuki | Tatsuya Kawahara | Ryuichiro Higashinaka
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf abs
Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation
Yahui Fu | Koji Inoue | Chenhui Chu | Tatsuya Kawahara
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user’s experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user’s perspective, ignoring the system’s perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user’s perspective (user’s desires and reactions) and the system’s perspective (system’s intentions and reactions). We enhance ChatGPT’s ability to reason for the system’s perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.

2022

pdf abs
Simultaneous Job Interview System Using Multiple Semi-autonomous Agents
Haruki Kawai | Yusuke Muraki | Kenta Yamamoto | Divesh Lala | Koji Inoue | Tatsuya Kawahara
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

In recent years, spoken dialogue systems have been applied to job interviews where an applicant talks to a system that asks pre-defined questions, called on-demand and self-paced job interviews. We propose a simultaneous job interview system, where one interviewer can conduct one-on-one interviews with multiple applicants simultaneously by cooperating with the multiple autonomous job interview dialogue systems. However, it is challenging for interviewers to monitor and understand all the parallel interviews done by the autonomous system at the same time. As a solution to this issue, we implemented two automatic dialogue understanding functions: (1) response evaluation of each applicant’s responses and (2) keyword extraction as a summary of the responses. It is expected that interviewers, as needed, can intervene in one dialogue and smoothly ask a proper question that elaborates the interview. We report a pilot experiment where an interviewer conducted simultaneous job interviews with three candidates.

2021

pdf abs
Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
Hirofumi Inaguma | Tatsuya Kawahara | Shinji Watanabe
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A conventional approach to improving the performance of end-to-end speech translation (E2E-ST) models is to leverage the source transcription via pre-training and joint training with automatic speech recognition (ASR) and neural machine translation (NMT) tasks. However, since the input modalities are different, it is difficult to leverage source language text successfully. In this work, we focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models. To leverage the full potential of the source language information, we propose backward SeqKD, SeqKD from a target-to-source backward NMT model. To this end, we train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder. The paraphrases are generated from the translations in bitext via back-translation. We further propose bidirectional SeqKD in which SeqKD from both forward and backward NMT models is combined. Experimental evaluations on both autoregressive and non-autoregressive models show that SeqKD in each direction consistently improves the translation performance, and the effectiveness is complementary regardless of the model capacity.

pdf abs
Multi-Referenced Training for Dialogue Response Generation
Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

In open-domain dialogue response generation, a dialogue context can be continued with diverse responses, and the dialogue models should capture such one-to-many relations. In this work, we first analyze the training objective of dialogue models from the view of Kullback-Leibler divergence (KLD) and show that the gap between the real world probability distribution and the single-referenced data’s probability distribution prevents the model from learning the one-to-many relations efficiently. Then we explore approaches to multi-referenced training in two aspects. Data-wise, we generate diverse pseudo references from a powerful pretrained model to build multi-referenced data that provides a better approximation of the real-world distribution. Model-wise, we propose to equip variational models with an expressive prior, named linear Gaussian model (LGM). Experimental results of automated evaluation and human evaluation show that the methods yield significant improvements over baselines.

pdf abs
ERICA: An Empathetic Android Companion for Covid-19 Quarantine
Etsuko Ishii | Genta Indra Winata | Samuel Cahyawijaya | Divesh Lala | Tatsuya Kawahara | Pascale Fung
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects of the user interface: a web-based virtual agent, Nora vs. the android ERICA via a video call. The experimental results show that the android can offer a more valuable user experience by giving the impression of being more empathetic and engaging in the conversation due to its nonverbal information, such as facial expressions and body gestures.

pdf abs
A multi-party attentive listening robot which stimulates involvement from side participants
Koji Inoue | Hiromi Sakamoto | Kenta Yamamoto | Divesh Lala | Tatsuya Kawahara
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We demonstrate the moderating abilities of a multi-party attentive listening robot system when multiple people are speaking in turns. Our conventional one-on-one attentive listening system generates listener responses such as backchannels, repeats, elaborating questions, and assessments. In this paper, additional robot responses that stimulate a listening user (side participant) to become more involved in the dialogue are proposed. The additional responses elicit assessments and questions from the side participant, making the dialogue more empathetic and lively.

2020

pdf abs
Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language
Kohei Matsuura | Sei Ueno | Masato Mimura | Shinsuke Sakai | Tatsuya Kawahara
Proceedings of the Twelfth Language Resources and Evaluation Conference

Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of them are transcribed so far. Thus, we started a project of automatic speech recognition (ASR) for the Ainu language in order to contribute to the development of annotated language archives. In this paper, we report speech corpus development and the structure and performance of end-to-end ASR for Ainu. We investigated four modeling units (phone, syllable, word piece, and word) and found that the syllable-based model performed best in terms of both word and phone recognition accuracy, which were about 60% and over 85% respectively in speaker-open condition. Furthermore, word and phone accuracy of 80% and 90% has been achieved in a speaker-closed setting. We also found out that a multilingual ASR training with additional speech corpora of English and Japanese further improves the speaker-open test accuracy.

pdf abs
An Attentive Listening System with Android ERICA: Comparison of Autonomous and WOZ Interactions
Koji Inoue | Divesh Lala | Kenta Yamamoto | Shizuka Nakamura | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We describe an attentive listening system for the autonomous android robot ERICA. The proposed system generates several types of listener responses: backchannels, repeats, elaborating questions, assessments, generic sentimental responses, and generic responses. In this paper, we report a subjective experiment with 20 elderly people. First, we evaluated each system utterance excluding backchannels and generic responses, in an offline manner. It was found that most of the system utterances were linguistically appropriate, and they elicited positive reactions from the subjects. Furthermore, 58.2% of the responses were acknowledged as being appropriate listener responses. We also compared the proposed system with a WOZ system where a human operator was operating the robot. From the subjective evaluation, the proposed system achieved comparable scores in basic skills of attentive listening such as encouragement to talk, focused on the talk, and actively listening. It was also found that there is still a gap between the system and the WOZ for more sophisticated skills such as dialogue understanding, showing interest, and empathy towards the user.

pdf abs
Designing Precise and Robust Dialogue Response Evaluators
Tianyu Zhao | Divesh Lala | Tatsuya Kawahara
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing.

pdf abs
Topic-relevant Response Generation using Optimal Transport for an Open-domain Dialog System
Shuying Zhang | Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 28th International Conference on Computational Linguistics

Conventional neural generative models tend to generate safe and generic responses which have little connection with previous utterances semantically and would disengage users in a dialog system. To generate relevant responses, we propose a method that employs two types of constraints - topical constraint and semantic constraint. Under the hypothesis that a response and its context have higher relevance when they share the same topics, the topical constraint encourages the topics of a response to match its context by conditioning response decoding on topic words’ embeddings. The semantic constraint, which encourages a response to be semantically related to its context by regularizing the decoding objective function with semantic distance, is proposed. Optimal transport is applied to compute a weighted semantic distance between the representation of a response and the context. Generated responses are evaluated by automatic metrics, as well as human judgment, showing that the proposed method can generate more topic-relevant and content-rich responses than conventional models.

2018

pdf abs
A Unified Neural Architecture for Joint Dialog Act Segmentation and Recognition in Spoken Dialog System
Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

In spoken dialog systems (SDSs), dialog act (DA) segmentation and recognition provide essential information for response generation. A majority of previous works assumed ground-truth segmentation of DA units, which is not available from automatic speech recognition (ASR) in SDS. We propose a unified architecture based on neural networks, which consists of a sequence tagger for segmentation and a classifier for recognition. The DA recognition model is based on hierarchical neural networks to incorporate the context of preceding sentences. We investigate sharing some layers of the two components so that they can be trained jointly and learn generalized features from both tasks. An evaluation on the Switchboard Dialog Act (SwDA) corpus shows that the jointly-trained models outperform independently-trained models, single-step models, and other reported results in DA segmentation, recognition, and joint tasks.

2017

pdf abs
Joint Learning of Dialog Act Segmentation and Recognition in Spoken Dialog Using Neural Networks
Tianyu Zhao | Tatsuya Kawahara
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Dialog act segmentation and recognition are basic natural language understanding tasks in spoken dialog systems. This paper investigates a unified architecture for these two tasks, which aims to improve the model’s performance on both of the tasks. Compared with past joint models, the proposed architecture can (1) incorporate contextual information in dialog act recognition, and (2) integrate models for tasks of different levels as a whole, i.e. dialog act segmentation on the word level and dialog act recognition on the segment level. Experimental results show that the joint training system outperforms the simple cascading system and the joint coding system on both dialog act segmentation and recognition tasks.

pdf abs
Attentive listening system with backchanneling, response generation and flexible turn-taking
Divesh Lala | Pierrick Milhorat | Koji Inoue | Masanari Ishida | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Attentive listening systems are designed to let people, especially senior people, keep talking to maintain communication ability and mental health. This paper addresses key components of an attentive listening system which encourages users to talk smoothly. First, we introduce continuous prediction of end-of-utterances and generation of backchannels, rather than generating backchannels after end-point detection of utterances. This improves subjective evaluations of backchannels. Second, we propose an effective statement response mechanism which detects focus words and responds in the form of a question or partial repeat. This can be applied to any statement. Moreover, a flexible turn-taking mechanism is designed which uses backchannels or fillers when the turn-switch is ambiguous. These techniques are integrated into a humanoid robot to conduct attentive listening. We test the feasibility of the system in a pilot experiment and show that it can produce coherent dialogues during conversation.

2016

pdf
Talking with ERICA, an autonomous android
Koji Inoue | Pierrick Milhorat | Divesh Lala | Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf abs
Automatic Speech Recognition Errors as a Predictor of L2 Listening Difficulties
Maryam Sadat Mirzaei | Kourosh Meshgi | Tatsuya Kawahara
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

This paper investigates the use of automatic speech recognition (ASR) errors as indicators of the second language (L2) learners’ listening difficulties and in doing so strives to overcome the shortcomings of Partial and Synchronized Caption (PSC) system. PSC is a system that generates a partial caption including difficult words detected based on high speech rate, low frequency, and specificity. To improve the choice of words in this system, and explore a better method to detect speech challenges, ASR errors were investigated as a model of the L2 listener, hypothesizing that some of these errors are similar to those of language learners’ when transcribing the videos. To investigate this hypothesis, ASR errors in transcription of several TED talks were analyzed and compared with PSC’s selected words. Both the overlapping and mismatching cases were analyzed to investigate possible improvement for the PSC system. Those ASR errors that were not detected by PSC as cases of learners’ difficulties were further analyzed and classified into four categories: homophones, minimal pairs, breached boundaries and negatives. These errors were embedded into the baseline PSC to make the enhanced version and were evaluated in an experiment with L2 learners. The results indicated that the enhanced version, which encompasses the ASR errors addresses most of the L2 learners’ difficulties and better assists them in comprehending challenging video segments as compared with the baseline.

2014

pdf abs
Japanese-to-English patent translation system based on domain-adapted word segmentation and post-ordering
Katsuhito Sudoh | Masaaki Nagata | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

This paper presents a Japanese-to-English statistical machine translation system specialized for patent translation. Patents are practically useful technical documents, but their translation needs different efforts from general-purpose translation. There are two important problems in the Japanese-to-English patent translation: long distance reordering and lexical translation of many domain-specific terms. We integrated novel lexical translation of domain-specific terms with a syntax-based post-ordering framework that divides the machine translation problem into lexical translation and reordering explicitly for efficient syntax-based translation. The proposed lexical translation consists of a domain-adapted word segmentation and an unknown word transliteration. Experimental results show our system achieves better translation accuracy in BLEU and TER compared to the baseline methods.

pdf
Information Navigation System Based on POMDP that Tracks User Focus
Koichiro Yoshino | Tatsuya Kawahara
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

2013

pdf
Predicate Argument Structure Analysis using Partially Annotated Corpora
Koichiro Yoshino | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf
Language Modeling for Spoken Dialogue System based on Filtering using Predicate-Argument Structures
Koichiro Yoshino | Shinsuke Mori | Tatsuya Kawahara
Proceedings of COLING 2012

pdf abs
Designing an Evaluation Framework for Spoken Term Detection and Spoken Document Retrieval at the NTCIR-9 SpokenDoc Task
Tomoyosi Akiba | Hiromitsu Nishizaki | Kiyoaki Aikawa | Tatsuya Kawahara | Tomoko Matsui
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the evaluation framework for spoken document retrieval for the IR for the Spoken Documents Task, conducted in the ninth NTCIR Workshop. The two parts of this task were a spoken term detection (STD) subtask and an ad hoc spoken document retrieval subtask (SDR). Both subtasks target search terms, passages and documents included in academic and simulated lectures of the Corpus of Spontaneous Japanese. Seven teams participated in the STD subtask and five in the SDR subtask. The results obtained through the evaluation in the workshop are discussed.

pdf
Machine Translation without Words through Substring Alignment
Graham Neubig | Taro Watanabe | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Multi-modal Sensing and Analysis of Poster Conversations: Toward Smart Posterboard
Tatsuya Kawahara
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2011

pdf
An Unsupervised Model for Joint Phrase Alignment and Extraction
Graham Neubig | Taro Watanabe | Eiichiro Sumita | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Spoken Dialogue System based on Information Extraction using Similarity of Predicate Argument Structures
Koichiro Yoshino | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the SIGDIAL 2011 Conference

2008

The Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing of the Information Processing Society of Japan, is developing a test collection for evaluation of spoken document retrieval systems. A prototype of the test collection consists of a set of textual queries, relevant segment lists, and transcriptions by an automatic speech recognition system, allowing retrieval from the Corpus of Spontaneous Japanese (CSJ). From about 100 initial queries, application of the criteria that a query should have more than five relevant segments that consist of about one minute speech segments yielded 39 queries. Targeting the test collection, an ad hoc retrieval experiment was also conducted to assess the baseline retrieval performance by applying a standard method for spoken document retrieval.

pdf
Bayes Risk-based Dialogue Management for Document Retrieval System with Speech Interface
Teruhisa Misu | Tatsuya Kawahara
Coling 2008: Companion volume: Posters

2006

pdf abs
Dependency-structure Annotation to Corpus of Spontaneous Japanese
Kiyotaka Uchimoto | Ryoji Hamabe | Takehiko Maruyama | Katsuya Takanashi | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus inJapanese, based on a dependency grammar. In the same way, thesyntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), isrepresented by dependency relationships between bunsetsus. This paper describes the criteria and definitions of dependency relationships between bunsetsus in the CSJ. The dependency structure of the CSJ is investigated, and the difference in the dependency structures ofwritten text and spontaneous speech is discussed in terms of thedependency accuracies obtained by using a corpus-based model. It is shown that the accuracy of automatic dependency-structure analysis canbe improved if characteristic phenomena of spontaneous speech such as self-corrections, basic utterance units in spontaneous speech, and bunsetsus that have no modifiee are detected and used for dependency-structure analysis.

pdf
Detection of Quotations and Inserted Clauses and Its Application to Dependency Structure Analysis in Spontaneous Japanese
Ryoji Hamabe | Kiyotaka Uchimoto | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions