2024
pdf
abs
Analysis on Unsupervised Acquisition Process of Bilingual Vocabulary through Iterative Back-Translation
Takuma Tanigawa
|
Tomoyosi Akiba
|
Hajime Tsukada
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In this paper, we investigate how new bilingual vocabulary is acquired through Iterative Back-Translation (IBT), which is known as a data augmentation method for machine translation from monolingual data of both source and target languages. To reveal the acquisition process, we first identify the word translation pairs in test data that do not exist in a bilingual data but do only in two monolingual data, then observe how many pairs are successfully translated by the translation model trained through IBT. We experimented on it with domain adaptation settings on two language pairs. Our experimental evaluation showed that more than 60% of the new bilingual vocabulary is successfully acquired through IBT along with the improvement in the translation quality in terms of BLEU. It also revealed that new bilingual vocabulary was gradually acquired by repeating IBT iterations. From the results, we present our hypothesis on the process of new bilingual vocabulary acquisition where the context of the words plays a critical role in the success of the acquisition.
pdf
abs
Masking Explicit Pro-Con Expressions for Development of a Stance Classification Dataset on Assembly Minutes
Tomoyosi Akiba
|
Yuki Gato
|
Yasutomo Kimura
|
Yuzu Uchida
|
Keiichi Takamaru
Proceedings of the Second Workshop on Natural Language Processing for Political Sciences @ LREC-COLING 2024
In this paper, a new dataset for Stance Classification based on assembly minutes is introduced. We develop it by using publicity available minutes taken from diverse Japanese local governments including prefectural, city, and town assemblies. In order to make the task to predict a stance from content of a politician’s utterance without explicit stance expressions, predefined words that directly convey the speaker’s stance in the utterance are replaced by a special token. Those masked words are also used to assign a golden label, either agreement or disagreement, to the utterance. Finally, we constructed total 15,018 instances automatically from 47 Japanese local governments. The dataset is used in the shared Stance Classification task evaluated in the NTCIR-17 QA-Lab-PoliInfo-4, and is now publicity available. Since the construction method of the dataset is automatic, we can still apply it to obtain more instances from the other Japanese local governments.
2012
pdf
abs
Statistical Machine Translation without Source-side Parallel Corpus Using Word Lattice and Phrase Extension
Takanori Kusumoto
|
Tomoyosi Akiba
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Statistical machine translation (SMT) requires a parallel corpus between the source and target languages. Although a pivot-translation approach can be applied to a language pair that does not have a parallel corpus directly between them, it requires both source―pivot and pivot―target parallel corpora. We propose a novel approach to apply SMT to a resource-limited source language that has no parallel corpus but has only a word dictionary for the pivot language. The problems with dictionary-based translations lie in their ambiguity and incompleteness. The proposed method uses a word lattice representation of the pivot-language candidates and word lattice decoding to deal with the ambiguity; the lattice expansion is accomplished by using a pivot―target phrase translation table to compensate for the incompleteness. Our experimental evaluation showed that this approach is promising for applying SMT, even when a source-side parallel corpus is lacking.
pdf
abs
Designing an Evaluation Framework for Spoken Term Detection and Spoken Document Retrieval at the NTCIR-9 SpokenDoc Task
Tomoyosi Akiba
|
Hiromitsu Nishizaki
|
Kiyoaki Aikawa
|
Tatsuya Kawahara
|
Tomoko Matsui
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We describe the evaluation framework for spoken document retrieval for the IR for the Spoken Documents Task, conducted in the ninth NTCIR Workshop. The two parts of this task were a spoken term detection (STD) subtask and an ad hoc spoken document retrieval subtask (SDR). Both subtasks target search terms, passages and documents included in academic and simulated lectures of the Corpus of Spontaneous Japanese. Seven teams participated in the STD subtask and five in the SDR subtask. The results obtained through the evaluation in the workshop are discussed.
2010
pdf
abs
Language Modeling Approach for Retrieving Passages in Lecture Audio Data
Koichiro Honda
|
Tomoyosi Akiba
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Spoken Document Retrieval (SDR) is a promising technology for enhancing the utility of spoken materials. After the spoken documents have been transcribed by using a Large Vocabulary Continuous Speech Recognition (LVCSR) decoder, a text-based ad hoc retrieval method can be applied directly to the transcribed documents. However, recognition errors will significantly degrade the retrieval performance. To address this problem, we have previously proposed a method that aimed to fill the gap between automatically transcribed text and correctly transcribed text by using a statistical translation technique. In this paper, we extend the method by (1) using neighboring context to index the target passage, and (2) applying a language modeling approach for document retrieval. Our experimental evaluation shows that context information can improve retrieval performance, and that the language modeling approach is effective in incorporating context information into the proposed SDR method, which uses a translation model.
2008
pdf
abs
Test Collections for Spoken Document Retrieval from Lecture Audio Data
Tomoyosi Akiba
|
Kiyoaki Aikawa
|
Yoshiaki Itoh
|
Tatsuya Kawahara
|
Hiroaki Nanjo
|
Hiromitsu Nishizaki
|
Norihito Yasuda
|
Yoichi Yamashita
|
Katunobu Itou
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing of the Information Processing Society of Japan, is developing a test collection for evaluation of spoken document retrieval systems. A prototype of the test collection consists of a set of textual queries, relevant segment lists, and transcriptions by an automatic speech recognition system, allowing retrieval from the Corpus of Spontaneous Japanese (CSJ). From about 100 initial queries, application of the criteria that a query should have more than five relevant segments that consist of about one minute speech segments yielded 39 queries. Targeting the test collection, an ad hoc retrieval experiment was also conducted to assess the baseline retrieval performance by applying a standard method for spoken document retrieval.
pdf
Statistical Machine Translation based Passage Retrieval for Cross-Lingual Question Answering
Tomoyosi Akiba
|
Kei Shimizu
|
Atsushi Fujii
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II
2006
pdf
abs
Exploiting Dynamic Passage Retrieval for Spoken Question Recognition and Context Processing towards Speech-driven Information Access Dialogue
Tomoyosi Akiba
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Speech interfaces and dialogue processing abilities have promise for improving the utility of open-domain question answering (QA).We propose a novel method of resolving disambiguation problems arisen in those speech and dialogue enhanced QA tasks. The proposed method exploits passage retrieval, which is one of main components common in many QA systems. The basic idea of the method is that the similarity with some passage in the target documents can be used to select the appropriate question from the candidates. In this paper, we applied the method to solve two subtasks of QA, which are (1) N-best rescoring of LVCSR outputs, which selects a most appropriate candidate as a question sentence, in speech-driven QA (SDQA) task and (2) context processing, which compose a complete question sentence from a submitted incomplete one by using the elements appeared in the dialogue context, in information access dialogue (IAD) task. For both tasks, a dynamic passage retrieval is introduced to further improve the performance. The experimental results showed that the proposed method is quite effective in order to improve the performance of QA in both two tasks.
2004
pdf
Collecting Spontaneously Spoken Queries for Information Retrieval
Tomoyosi Akiba
|
Atsushi Fujii
|
Katunobu Itou
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
1994
pdf
A Bayesian Approach for User Modeling in Dialogue Systems
Tomoyosi Akiba
|
Hozumi Tanaka
COLING 1994 Volume 2: The 15th International Conference on Computational Linguistics