Kiyotaka Uchimoto

2018

pdf
Extending Search System based on Interactive Visualization for Speech Corpora
Tomoko Ohsuga | Yuichi Ishimoto | Tomoko Kajiyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shuichi Itahashi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).

2010

pdf abs
Adapting Chinese Word Segmentation for Machine Translation Based on Short Units
Yiou Wang | Kiyotaka Uchimoto | Jun’ichi Kazama | Canasai Kruengkrai | Kentaro Torisawa
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In Chinese texts, words composed of single or multiple characters are not separated by spaces, unlike most western languages. Therefore Chinese word segmentation is considered an important first step in machine translation (MT) and its performance impacts MT results. Many factors affect Chinese word segmentations, including the segmentation standards and segmentation strategies. The performance of a corpus-based word segmentation model depends heavily on the quality and the segmentation standard of the training corpora. However, we observed that existing manually annotated Chinese corpora tend to have low segmentation granularity and provide poor morphological information due to the present segmentation standards. In this paper, we introduce a short-unit standard of Chinese word segmentation, which is particularly suitable for machine translation, and propose a semi-automatic method of transforming the existing corpora into the ones that can satisfy our standards. We evaluate the usefulness of our approach on the basis of translation tasks from the technology newswire domain and the scientific paper domain, and demonstrate that it significantly improves the performance of Chinese-Japanese machine translation (over 1.0 BLEU increase).

pdf abs
Collection of Usage Information for Language Resources from Academic Articles
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus.

2009

pdf
Improving Dependency Parsing with Subtrees from Auto-Parsed Data
Wenliang Chen | Jun’ichi Kazama | Kiyotaka Uchimoto | Kentaro Torisawa
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
Can Chinese Phonemes Improve Machine Transliteration?: A Comparative Study of English-to-Chinese Transliteration Models
Jong-Hoon Oh | Kiyotaka Uchimoto | Kentaro Torisawa
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf
Multilingual Dependency Learning: Exploiting Rich Features for Tagging Syntactic and Semantic Dependencies
Hai Zhao | Wenliang Chen | Jun’ichi Kazama | Kiyotaka Uchimoto | Kentaro Torisawa
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task

pdf
Machine Transliteration using Target-Language Grapheme and Phoneme: Multi-engine Transliteration Approach
Jong-Hoon Oh | Kiyotaka Uchimoto | Kentaro Torisawa
Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration (NEWS 2009)

pdf
Bilingual Co-Training for Monolingual Hyponymy-Relation Acquisition
Jong-Hoon Oh | Kiyotaka Uchimoto | Kentaro Torisawa
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf
An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
Canasai Kruengkrai | Kiyotaka Uchimoto | Jun’ichi Kazama | Yiou Wang | Kentaro Torisawa | Hitoshi Isahara
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf
Dependency Parsing with Short Dependency Relations in Unlabeled Data
Wenliang Chen | Daisuke Kawahara | Kiyotaka Uchimoto | Yujie Zhang | Hitoshi Isahara
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
Learning Reliability of Parses for Domain Adaptation of Dependency Parsing
Daisuke Kawahara | Kiyotaka Uchimoto
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf abs
Boot-Strapping a WordNet Using Multiple Existing WordNets
Francis Bond | Hitoshi Isahara | Kyoko Kanzaki | Kiyotaka Uchimoto
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the construction of an illustrated Japanese Wordnet. We bootstrap the Wordnet using existing multiple existing wordnets in order to deal with the ambiguity inherent in translation. We illustrate it with pictures from the Open Clip Art Library.

pdf abs
A Method for Automatically Constructing Case Frames for English
Daisuke Kawahara | Kiyotaka Uchimoto
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Case frames are an important knowledge base for a variety of natural language processing (NLP) systems. For the practical use of these systems in the real world, wide-coverage case frames are required. In order to acquire such large-scale case frames, in this paper, we automatically compile case frames from a large corpus. The resultant case frames that are compiled from the English Gigaword corpus contain 9,300 verb entries. The case frames include most examples of normal usage, and are ready to be used in numerous NLP analyzers and applications.

pdf abs
Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and its Application
Kiyotaka Uchimoto | Yasuharu Den
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In Japanese, the syntactic structure of a sentence is generally represented by the relationship between phrasal units, bunsetsus in Japanese, based on a dependency grammar. In many cases, the syntactic structure of a bunsetsu is not considered in syntactic structure annotation. This paper gives the criteria and definitions of dependency relationships between words in a bunsetsu and their applications. The target corpus for the word-level dependency annotation is a large spontaneous Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ). One application of word-level dependency relationships is to find basic units for constructing accent phrases.

pdf abs
Development of the Japanese WordNet
Hitoshi Isahara | Francis Bond | Kiyotaka Uchimoto | Masao Utiyama | Kyoko Kanzaki
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

After a long history of compilation of our own lexical resources, EDR Japanese/English Electronic Dictionary, and discussions with major players on development of various WordNets, Japanese National Institute of Information and Communications Technology started developing the Japanese WordNet in 2006 and will publicly release the first version, which includes both the synset in Japanese and the annotated Japanese corpus of SemCor, in June 2008. As the first step in compiling the Japanese WordNet, we added Japanese equivalents to synsets of the Princeton WordNet. Of course, we must also add some synsets which do not exist in the Princeton WordNet, and must modify synsets in the Princeton WordNet, in order to make the hierarchical structure of Princeton synsets represent thesaurus-like information found in the Japanese language, however, we will address these tasks in a future study. We then translated English sentences which are used in the SemCor annotation into Japanese and annotated them using our Japanese WordNet. This article describes the overview of our project to compile Japanese WordNet and other resources which relate to our Japanese WordNet.

pdf abs
Automatic Acquisition of Usage Information for Language Resources
Shunsuke Kozawa | Hitomi Tohyama | Kiyotaka Uchimoto | Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, language resources (LRs) are becoming indispensable for linguistic research. Unfortunately, it is not easy to find their usages by searching the web even though they must be described in the Internet or academic articles. This indicates that the intrinsic value of LRs is not recognized very well. In this research, therefore, we extract a list of usage information for each LR to promote the efficient utilization of LRs. In this paper, we proposed a method for extracting a list of usage information from academic articles by using rules based on syntactic information. The rules are generated by focusing on the syntactic features that are observed in the sentences describing usage information. As a result of experiments, we achieved 72.9% in recall and 78.4% in precision for the closed test and 60.9% in recall and 72.7% in precision for the open test.

pdf abs
Construction of a Metadata Database for Efficient Development and Use of Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The National Institute of Information and Communications Technology (NICT) and Nagoya University have been jointly constructing a large scale database named SHACHI by collecting detailed meta-information on language resources (LRs) in Asia and Western countries, for the purpose of effectively combining LRs. The purpose of this project is to investigate languages, tag sets, and formats compiled in LRs throughout the world, to systematically store LR metadata, to create a search function for this information, and to ultimately utilize all this for a more efficient development of LRs. This metadata database contains more than 2,000 compiled LRs such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of LRs archive. Its metadata, an extended version of OLAC metadata set conforming to Dublin Core, which contain detailed meta-information, have been collected semi-automatically. This paper explains the design and the structure of the metadata database, as well as the realization of the catalogue search tool. Additionally, the website of this database is now open to the public and accessible to all Internet users.

pdf abs
Word Alignment Annotation in a Japanese-Chinese Parallel Corpus
Yujie Zhang | Zhulong Wang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Parallel corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word & phrase alignment is of significance to provide gold-standard for developing and evaluating both example-based machine translation model and statistical machine translation model. This paper presents the work of word & phrase alignment annotation in the NICT Japanese-Chinese parallel corpus, which is constructed at the National Institute of Information and Communications Technology (NICT). We describe the specification of word alignment annotation and the tools specially developed for the manual annotation. The manual annotation on 17,000 sentence pairs has been completed. We examined the manually annotated word alignment data and extracted translation knowledge from the word & phrase aligned corpus.

pdf
Construction of an Infrastructure for Providing Users with Suitable Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Coling 2008: Companion volume: Posters

Careful tuning of user-created dictionaries is indispensable when using a machine translation system for computer aided translation. However, there is no widely used standard for user dictionaries in the Japanese/English machine translation market. To address this issue, AAMT (the Asia-Pacific Association for Machine Translation) has established a specification of sharable dictionaries (UTX-S: Universal Terminology eXchange -- Simple), which can be used across different machine translation systems, thus increasing the interoperability of language resources. UTX-S is simpler than existing specifications such as UPF and OLIF. It was explicitly designed to make it easy to (a) add new user dictionaries and (b) share existing user dictionaries. This facilitates rapid user dictionary production and avoids vendor tie in. In this study we describe the UTX-Simple (UTX-S) format, and show that it can be converted to the user dictionary formats for five commercial English-Japanese MT systems. We then present a case study where we (a) convert an on-line glossary to UTX-S, and (b) produce user dictionaries for five different systems, and then exchange them. The results show that the simplified format of UTX-S can be used to rapidly build dictionaries. Further, we confirm that customized user dictionaries are effective across systems, although with a slight loss in quality: on average, user dictionaries improved the translations for 44.8% of translations with the systems they were built for and 37.3% of translations for different systems. In ongoing work, AAMT is using UTX-S as the format in building up a user community for producing, sharing, and accumulating user dictionaries in a sustainable way.

2007

pdf
Automatic Evaluation of Machine Translation Based on Rate of Accomplishment of Sub-Goals
Kiyotaka Uchimoto | Katsunori Kotani | Yujie Zhang | Hitoshi Isahara
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf
Minimally Lexicalized Dependency Parsing
Daisuke Kawahara | Kiyotaka Uchimoto
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

pdf
A Hybrid Approach to Word Segmentation and POS Tagging
Tetsuji Nakagawa | Kiyotaka Uchimoto
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf abs
Dependency-structure Annotation to Corpus of Spontaneous Japanese
Kiyotaka Uchimoto | Ryoji Hamabe | Takehiko Maruyama | Katsuya Takanashi | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus inJapanese, based on a dependency grammar. In the same way, thesyntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), isrepresented by dependency relationships between bunsetsus. This paper describes the criteria and definitions of dependency relationships between bunsetsus in the CSJ. The dependency structure of the CSJ is investigated, and the difference in the dependency structures ofwritten text and spontaneous speech is discussed in terms of thedependency accuracies obtained by using a corpus-based model. It is shown that the accuracy of automatic dependency-structure analysis canbe improved if characteristic phenomena of spontaneous speech such as self-corrections, basic utterance units in spontaneous speech, and bunsetsus that have no modifiee are detected and used for dependency-structure analysis.

pdf abs
Automatic Detection and Semi-Automatic Revision of Non-Machine-Translatable Parts of a Sentence
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We developed a method for automatically distinguishing the machine-translatable and non-machine-translatable parts of a given sentence for a particular machine translation (MT) system. They can be distinguished by calculating the similarity between a source-language sentence and its back translation for each part of the sentence. The parts with low similarities are highly likely to be non-machine-translatable parts. We showed that the parts of a sentence that are automatically distinguished as non-machine-translatable provide useful information for paraphrasing or revising the sentence in the source language to improve the quality of the translation by the MT system. We also developed a method of providing knowledge useful to effectively paraphrasing or revising the detected non-machine-translatable parts. Two types of knowledge were extracted from the EDR dictionary: one for transforming a lexical entry into an expression used in the definition and the other for conducting the reverse paraphrasing, which transforms an expression found in a definition into the lexical entry. We found that the information provided by the methods helped improve the machine translatability of the originally input sentences.

pdf
Detection of Quotations and Inserted Clauses and Its Application to Dependency Structure Analysis in Spontaneous Japanese
Ryoji Hamabe | Kiyotaka Uchimoto | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

2005

pdf
Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora
Yujie Zhang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Error Annotation for Corpus of Japanese Learner English
Emi Izumi | Kiyotaka Uchimoto | Hitoshi Isahara
Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC-2005)

pdf abs
Building an Annotated Japanese-Chinese Parallel Corpus – A Part of NICT Multilingual Corpora
Yujie Zhang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated Japanese-Chinese parallel corpus in the world. We created the corpus by selecting Japanese sentences from Mainichi Newspaper and then manually translating them into Chinese. We then annotated the corpus with morphological and syntactic structures and alignments at word and phrase levels. This paper describes the specification in human translation and detailed information annotation, and the tools we developed in the project. The experience we obtained and points we paid special attentions are also introduced for share with other researches in corpora construction.

pdf abs
Automatic Rating of Machine Translatability
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

We describe a method for automatically rating the machine translatability of a sentence for various machine translation (MT) systems. The method requires that the MT system can bidirectionally translate sentences in both source and target languages. However, it does not require reference translations, as is usual for automatic MT evaluation. By applying this method to every component of a sentence in a given source language, we can automatically identify the machine-translatable and non-machinetranslatable parts of a sentence for a particular MT system. We show that the parts of a sentence that are automatically identified as nonmachine-translatable provide useful information for paraphrasing or revising the sentence in the source language, thus improving the quality of the final translation.

pdf
Analysis of Machine Translation Systems’ Errors in Tense, Aspect, and Modality
Masaki Murata | Kiyotaka Uchimoto | Qing Ma | Toshiyuki Kanamaru | Hitoshi Isahara
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation

We describe a new model for dependency structure analysis. This model learns the relationship between two phrasal units called bunsetsus as three categories; ‘between’, ‘dependent’, and ‘beyond’, and estimates the dependency likelihood by considering not only the relationship between two bunsetsus but also the relationship between the left bunsetsu and all of the bunsetsus to its right. We implemented this model based on the maximum entropy model. When using the Kyoto University corpus, the dependency accuracy of our model was 88%, which is about 1% higher than that of the conventional model using exactly the same features.

pdf
Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules
Kiyotaka Uchimoto | Qing Ma | Masaki Murata | Hiromi Ozaku | Hitoshi Isahara
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf
Hybrid Neuro and Rule-Based Part of Speech Taggers
Qing Ma | Masaki Murata | Kiyotaka Uchimoto | Hitoshi Isahara
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf
Bunsetsu Identification Using Category-Exclusive Rules
Masaki Murata | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf
Backward Beam Search Algorithm for Dependency Analysis of Japanese
Satoshi Sekine | Kiyotaka Uchimoto | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf
Word Order Acquisition from Corpora
Kiyotaka Uchimoto | Masaki Murata | Qing Ma | Satoshi Sekine | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics