Hitoshi Isahara

2022

2020

pdf
Improving Semantic Similarity Calculation of Japanese Text for MT Evaluation
Yuki Tanahashi | Kyoko Kanzaki | Eiko Yamamoto | Hitoshi Isahara
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

2019

pdf abs
Towards linking synonymous expressions of compound verbs to Japanese WordNet
Kyoko Kanzaki | Hitoshi Isahara
Proceedings of the 10th Global Wordnet Conference

This paper describes our project on Japanese compound verbs. Japanese “Verb (adnominal form) + Verb” compounds, which are treated as single verbs, frequently appear in daily communication. They are not sufficiently registered in Japanese dictionaries or thesauri. We are now compiling a list of the synonymous expressions of compound verbs in “compound verb lexicon” built by the National Institute of Japanese Language and Linguistics. We extracted synonymous words and phrases of compound verbs from five hundred million Japanese web corpora. As a result, synonymous expressions of 1800 compound verbs were obtained automatically among 2700 in the “compound verb lexicon”. From our data, we observed that some compound verbs represent not only motion but also additional nuances such as an emotional one. In order to reflect the abundant meanings that compound verbs own, we will try to think of a link of synonymous expressions to Japanese wordnet. Concretely, in the case of synonymous phrases, we try to link adverbial expressions which are a part of phrases to the adverbial synset in Japanese wordnet.

2018

pdf
Building a List of Synonymous Words and Phrases of Japanese Compound Verbs
Kyoko Kanzaki | Hitoshi Isahara
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain. ASPEC was constructed in the Japanese-Chinese machine translation project conducted between 2006 and 2010 using the Special Coordination Funds for Promoting Science and Technology. It consists of a Japanese-English scientific paper abstract corpus of approximately 3 million parallel sentences (ASPEC-JE) and a Chinese-Japanese scientific paper excerpt corpus of approximately 0.68 million parallel sentences (ASPEC-JC). ASPEC is used as the official dataset for the machine translation evaluation workshop WAT (Workshop on Asian Translation).

2014

pdf
Fusion of Multiple Semantic Networks and Human Association
Hitoshi Isahara | Kyoko Kanzaki | Eiko Yamamoto | Takayuki Kuribayashi | Michinaga Otsuka
Proceedings of the Seventh Global Wordnet Conference

2012

pdf abs
How Good Is Crowd Post-Editing? Its Potential and Limitations
Midori Tatsumi | Takako Aikawa | Kentaro Yamamoto | Hitoshi Isahara
Workshop on Post-Editing Technology and Practice

This paper is a partial report of a research effort on evaluating the effect of crowd-sourced post-editing. We first discuss the emerging trend of crowd-sourced post-editing of machine translation output, along with its benefits and drawbacks. Second, we describe the pilot study we have conducted on a platform that facilitates crowd-sourced post-editing. Finally, we provide our plans for further studies to have more insight on how effective crowd-sourced post-editing is.

pdf
Building Translation Awareness in Occasional Authors: A User Case from Japan
Midori Tatsumi | Anthony Hartley | Hitoshi Isahara | Kyo Kageura | Toshio Okamoto | Katsumasa Shimizu
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

pdf
Readability and Translatability Judgments for “Controlled Japanese”
Anthony Hartley | Midori Tatsumi | Hitoshi Isahara | Kyo Kageura | Rei Miyata
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

2011

pdf
System for Flexibly Judging the Misuse of Honorifics in Japanese
Tamotsu Shirado | Satoko Marumoto | Masaki Murata | Hitoshi Isahara
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation

pdf
Compiling Learner Corpus Data of Linguistic Output and Language Processing in Speaking, Listening, Writing, and Reading
Katsunori Kotani | Takehiko Yoshimi | Hiroaki Nanjo | Hitoshi Isahara
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf abs
Language Resource Management System for Asian WordNet Collaboration and Its Web Service Application
Virach Sornlertlamvanich | Thatsanee Charoenporn | Hitoshi Isahara
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper presents the language resource management system for the development and dissemination of Asian WordNet (AWN) and its web service application. We develop the platform to establish a network for the cross language WordNet development. Each node of the network is designed for maintaining the WordNet for a language. Via the table that maps between each language WordNet and the Princeton WordNet (PWN), the Asian WordNet is realized to visualize the cross language WordNet between the Asian languages. We propose a language resource management system, called WordNet Management System (WNMS), as a distributed management system that allows the server to perform the cross language WordNet retrieval, including the fundamental web service applications for editing, visualizing and language processing. The WNMS is implemented on a web service protocol therefore each node can be independently maintained, and the service of each language WordNet can be called directly through the web service API. In case of cross language implementation, the synset ID (or synset offset) defined by PWN is used to determined the linkage between the languages.

2009

pdf
An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging
Canasai Kruengkrai | Kiyotaka Uchimoto | Jun’ichi Kazama | Yiou Wang | Kentaro Torisawa | Hitoshi Isahara
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf
The “Close-Distant” Relation of Adjectival Concepts Based on Self-Organizing Map
Kyoko Kanzaki | Noriko Tomuro | Hitoshi Isahara
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)

pdf abs
Boot-Strapping a WordNet Using Multiple Existing WordNets
Francis Bond | Hitoshi Isahara | Kyoko Kanzaki | Kiyotaka Uchimoto
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper we describe the construction of an illustrated Japanese Wordnet. We bootstrap the Wordnet using existing multiple existing wordnets in order to deal with the ambiguity inherent in translation. We illustrate it with pictures from the Open Clip Art Library.

We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (Kybots). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.

pdf abs
Development of the Japanese WordNet
Hitoshi Isahara | Francis Bond | Kiyotaka Uchimoto | Masao Utiyama | Kyoko Kanzaki
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

After a long history of compilation of our own lexical resources, EDR Japanese/English Electronic Dictionary, and discussions with major players on development of various WordNets, Japanese National Institute of Information and Communications Technology started developing the Japanese WordNet in 2006 and will publicly release the first version, which includes both the synset in Japanese and the annotated Japanese corpus of SemCor, in June 2008. As the first step in compiling the Japanese WordNet, we added Japanese equivalents to synsets of the Princeton WordNet. Of course, we must also add some synsets which do not exist in the Princeton WordNet, and must modify synsets in the Princeton WordNet, in order to make the hierarchical structure of Princeton synsets represent thesaurus-like information found in the Japanese language, however, we will address these tasks in a future study. We then translated English sentences which are used in the SemCor annotation into Japanese and annotated them using our Japanese WordNet. This article describes the overview of our project to compile Japanese WordNet and other resources which relate to our Japanese WordNet.

pdf abs
Extraction of Informative Expressions from Domain-specific Documents
Eiko Yamamoto | Hitoshi Isahara | Akira Terada | Yasunori Abe
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

What kinds of lexical resources are helpful for extracting useful information from domain-specific documents? Although domain-specific documents contain much useful knowledge, it is not obvious how to extract such knowledge efficiently from the documents. We need to develop techniques for extracting hidden information from such domain-specific documents. These techniques do not necessarily use state-of-the-art technologies and achieve deep and accurate language understanding, but are based on huge amounts of linguistic resources, such as domain-specific lexical databases. In this paper, we introduce two techniques for extracting informative expressions from documents: the extraction of related words that are not only taxonomically related but also thematically related, and the acquisition of salient terms and phrases. With these techniques we then attempt to automatically and statistically extract domain-specific informative expressions in aviation documents as an example and evaluate the results.

pdf abs
Construction of a Metadata Database for Efficient Development and Use of Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The National Institute of Information and Communications Technology (NICT) and Nagoya University have been jointly constructing a large scale database named SHACHI by collecting detailed meta-information on language resources (LRs) in Asia and Western countries, for the purpose of effectively combining LRs. The purpose of this project is to investigate languages, tag sets, and formats compiled in LRs throughout the world, to systematically store LR metadata, to create a search function for this information, and to ultimately utilize all this for a more efficient development of LRs. This metadata database contains more than 2,000 compiled LRs such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of LRs archive. Its metadata, an extended version of OLAC metadata set conforming to Dublin Core, which contain detailed meta-information, have been collected semi-automatically. This paper explains the design and the structure of the metadata database, as well as the realization of the catalogue search tool. Additionally, the website of this database is now open to the public and accessible to all Internet users.

We describe recent work on MedSLT, a medium-vocabulary interlingua-based medical speech translation system, focussing on issues that arise when handling languages of which the grammar engineer has little or no knowledge. We show how we can systematically create and maintain multiple forms of grammars, lexica and interlingual representations, with some versions being used by language informants, and some by grammar engineers. In particular, we describe the advantages of structuring the interlingua definition as a simple semantic grammar, which includes a human-readable surface form. We show how this allows us to rationalise the process of evaluating translations between languages lacking common speakers, and also makes it possible to create a simple generic tool for debugging to-interlingua translation rules. Examples presented focus on the concrete case of translation between Japanese and Arabic in both directions.

pdf abs
A Dependency Parser for Thai
Shisanu Tongchim | Randolf Altmeyer | Virach Sornlertlamvanich | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents some preliminary results of our dependency parser for Thai. It is part of an ongoing project in developing a syntactically annotated Thai corpus. The parser has been trained and tested by using the complete part of the corpus. The parser achieves 83.64% as the root accuracy, 78.54% as the dependency accuracy and 53.90% as the complete sentence accuracy. The trained parser will be used as a preprocessing step in our corpus annotation workflow in order to accelerate the corpus development.

pdf abs
Word Alignment Annotation in a Japanese-Chinese Parallel Corpus
Yujie Zhang | Zhulong Wang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Parallel corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word & phrase alignment is of significance to provide gold-standard for developing and evaluating both example-based machine translation model and statistical machine translation model. This paper presents the work of word & phrase alignment annotation in the NICT Japanese-Chinese parallel corpus, which is constructed at the National Institute of Information and Communications Technology (NICT). We describe the specification of word alignment annotation and the tools specially developed for the manual annotation. The manual annotation on 17,000 sentence pairs has been completed. We examined the manually annotated word alignment data and extracted translation knowledge from the word & phrase aligned corpus.

pdf abs
Selection of Japanese-English Equivalents by Integrating High-quality Corpora and Huge Amounts of Web Data
Qing Ma | Koichi Nakao | Masaki Murata | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

As a first step to developing systems that enable non-native speakers to output near-perfect English sentences for given mixed English-Japanese sentences, we propose new approaches for selecting English equivalents by using the number of hits for various contexts in large English corpora. As the large English corpora, we not only used the huge amounts of Web data but also the manually compiled large, high-quality English corpora. Using high-quality corpora enables us to accurately select equivalents, and using huge amounts of Web data enables us to resolve the problem of the shortage of hits that normally occurs when using only high-quality corpora. The types and lengths of contexts used to select equivalents are variable and optimally determined according to the number of hits in the corpora, so that performance can be further refined. Computer experiments showed that the precision of our methods was much higher than that of the existing methods for equivalent selection.

pdf abs
Application of Resource-based Machine Translation to Real Business Scenes
Hitoshi Isahara | Masao Utiyama | Eiko Yamamoto | Akira Terada | Yasunori Abe
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

As huge quantities of documents have become available, services using natural language processing technologies trained by huge corpora have emerged, such as information retrieval and information extraction. In this paper we verify the usefulness of resource-based, or corpus-based, translation in the aviation domain as a real business situation. This study is important from both a business perspective and an academic perspective. Intuitively, manuals for similar products, or manuals for different versions of the same product, are likely to resemble each other. Therefore, even with only a small training data, a corpus-based MT system can output useful translations. The corpus-based approach is powerful when the target is repetitive. Manuals for similar products, or manuals for different versions of the same product, are real-world documents that are repetitive. Our experiments on translation of manual documents are still in a beginning stage. However, the BLEU score from very small number of training sentences is already rather high. We believe corpus-based machine translation is a player full of promise in this kind of actual business scene.

pdf abs
Extraction of Attribute Concepts from Japanese Adjectives
Kyoko Kanzaki | Francis Bond | Noriko Tomuro | Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We describe various syntactic and semantic conditions for finding abstractnouns which refer to concepts of adjectives from a text, in an attempt to explore the creation of a thesaurus from text. Depending on usages, six kinds of syntactic patterns are shown. In the syntactic and semantic conditions an omission of an abstract noun is mainly used, but in addition, various linguistic clues are needed. We then compare our results with synsets of Japanese WordNet. From a viewpoint of Japanese WordNet, the degree of agreement of ?Attribute? between our data and Japanese WordNet was 22%. On the other hand, the total number of differences of obtained abstract nouns was 267. From a viewpoint of our data,the degree of agreement of abstract nouns between our data and Japanese WordNet was 54%.

pdf
Dependency Parsing with Short Dependency Relations in Unlabeled Data
Wenliang Chen | Daisuke Kawahara | Kiyotaka Uchimoto | Yujie Zhang | Hitoshi Isahara
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
Hypothesis Selection in Machine Transliteration: A Web Mining Approach
Jong-Hoon Oh | Hitoshi Isahara
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf
Synset Assignment for Bi-lingual Dictionary with Limited Resource
Virach Sornlertlamvanich | Thatsanee Charoenporn | Chumpol Mokarat | Hitoshi Isahara | Hammam Riza | Purev Jaimai
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf
Non-Factoid Japanese Question Answering through Passage Retrieval that Is Weighted Based on Types of Answers
Masaki Murata | Sachiyo Tsukawaki | Toshiyuki Kanamaru | Qing Ma | Hitoshi Isahara
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-II

pdf
KUI: an ubiquitous tool for collective intelligence development
Thatsanee Charoenporn | Virach Sornlertlamvanich | Hitoshi Isahara | Kergrit Robkop
Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages

pdf
Enhanced Tools for Online Collaborative Language Resource Development
Virach Sornlertlamvanich | Thatsanee Charoenporn | Suphanut Thayaboon | Chumpol Mokarat | Hitoshi Isahara
Proceedings of the 6th Workshop on Asian Language Resources

pdf abs
Applicability of Resource-based Machine Translation to Airplane Manuals
Eiko Yamamoto | Akira Terada | Hitoshi Isahara
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Government and Commercial Uses of MT

Machine translation (MT) has been studied and developed since the advent of computers, and yet is rarely used in actual business. For business use, rule-based MT has been developed, but it requires rules and a domain-specific dictionary that have been created manually. On the other hand, as huge amounts of text data have become available, corpus-based MT has been actively studied, particularly corpus-based statistical machine translation (SMT). In this study, we tested and verified the usefulness of SMT for aviation manuals. Manuals tend to be similar and repetitive, so SMT is powerful even with a small amount of training data. Although our experiments with SMT are at the preliminary stage, the BLEU score is high. SMT appears to be a powerful and promising technique in this domain.

pdf
Learning Reliable Information for Dependency Parsing Adaptation
Wenliang Chen | Youzheng Wu | Hitoshi Isahara
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
Construction of an Infrastructure for Providing Users with Suitable Language Resources
Hitomi Tohyama | Shunsuke Kozawa | Kiyotaka Uchimoto | Shigeki Matsubara | Hitoshi Isahara
Coling 2008: Companion volume: Posters

pdf
Experiments in Base-NP Chunking and Its Role in Dependency Parsing for Thai
Shisanu Tongchim | Virach Sornlertlamvanich | Hitoshi Isahara
Coling 2008: Companion volume: Posters

2007

pdf
Automatic Evaluation of Machine Translation Based on Rate of Accomplishment of Sub-Goals
Kiyotaka Uchimoto | Katsunori Kotani | Yujie Zhang | Hitoshi Isahara
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf
A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation
Masao Utiyama | Hitoshi Isahara
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference

pdf
Japanese Expressions that Include English Expressions
Masaki Murata | Toshiyuki Kanamaru | Koichiro Nakamoto | Katsunori Kotani | Hitoshi Isahara
Proceedings of the 21st Pacific Asia Conference on Language, Information and Computation

pdf
A Two-Stage Parser for Multilingual Dependency Parsing
Wenliang Chen | Yujie Zhang | Hitoshi Isahara
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf
Machine transliteration using multiple transliteration engines and hypothesis re-ranking
Jong-Hoon Oh | Hitoshi Isahara
Proceedings of Machine Translation Summit XI: Papers

pdf
A Japanese-English patent parallel corpus
Masao Utiyama | Hitoshi Isahara
Proceedings of Machine Translation Summit XI: Papers

pdf
Building Japanese-Chinese translation dictionary based on EDR Japanese-English bilingual dictionary
Yujie Zhang | Qing Ma | Hitoshi Isahara
Proceedings of Machine Translation Summit XI: Papers

pdf
Extracting Word Sets with Non-Taxonomical Relation
Eiko Yamamoto | Hitoshi Isahara
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions

2006

pdf
An Empirical Study of Chinese Chunking
Wenliang Chen | Yujie Zhang | Hitoshi Isahara
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
Detection of Quotations and Inserted Clauses and Its Application to Dependency Structure Analysis in Spontaneous Japanese
Ryoji Hamabe | Kiyotaka Uchimoto | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf
Machine-Learning-Based Transformation of Passive Japanese Sentences into Active by Separating Training Data into Each Input Particle
Masaki Murata | Toshiyuki Kanamaru | Tamotsu Shirado | Hitoshi Isahara
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions

pdf abs
Blind Evaluation for Thai Search Engines
Shisanu Tongchim | Prapass Srichaivattana | Virach Sornlertlamvanich | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper compares the effectiveness of two different Thai search engines by using a blind evaluation. The probabilistic-based dictionary-less search engine is evaluated against the traditional word-based indexing method. The web documents from 12 Thai newspaper web sites consisting of 83,453 documents are used as the test collection. The relevance judgment is conducted on the first five returned results from each system. The evaluation process is completely blind. That is, the retrieved documents from both systems are shown to the judges without any information about thesearch techniques. Statistical testing shows that the dictionary-less approach is better than the word-based indexingapproach in terms of the number of found documents and the number of relevance documents.

pdf abs
A Conditional Random Field Framework for Thai Morphological Analysis
Canasai Kruengkrai | Virach Sornlertlamvanich | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper presents a framework for Thai morphological analysis based on the theoretical background of conditional random fields. We formulate morphological analysis of an unsegmented language as the sequential supervised learning problem. Given a sequence of characters, all possibilities of word/tag segmentation are generated, and then the optimal path is selected with some criterion. We examine two different techniques, including the Viterbi score and the confidence estimation. Preliminary results are given to show the feasibility of our proposed framework.

pdf abs
Dependency-structure Annotation to Corpus of Spontaneous Japanese
Kiyotaka Uchimoto | Ryoji Hamabe | Takehiko Maruyama | Katsuya Takanashi | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus inJapanese, based on a dependency grammar. In the same way, thesyntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), isrepresented by dependency relationships between bunsetsus. This paper describes the criteria and definitions of dependency relationships between bunsetsus in the CSJ. The dependency structure of the CSJ is investigated, and the difference in the dependency structures ofwritten text and spontaneous speech is discussed in terms of thedependency accuracies obtained by using a corpus-based model. It is shown that the accuracy of automatic dependency-structure analysis canbe improved if characteristic phenomena of spontaneous speech such as self-corrections, basic utterance units in spontaneous speech, and bunsetsus that have no modifiee are detected and used for dependency-structure analysis.

pdf abs
Creation of a Japanese Adverb Dictionary that Includes Information on the Speaker’s Communicative Intention Using Machine Learning
Toshiyuki Kanamaru | Masaki Murata | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Japanese adverbs are classified as either declarative or normal; the former declare the communicative intention of the speaker, while the latter convey a manner of action, a quantity, or a degree by which the adverb modifies the verb or adjective that it accompanies. We have automatically classified adverbs as either declarative or not declarative using a machine-learning method such as the maximum entropy method. We defined adverbs having positive or negative connotations as the positive data. We classified adverbs in the EDR dictionary and IPADIC used by Chasen using this result and built an adverb dictionary that contains descriptions of the communicative intentions of the speaker.

pdf abs
Automatic Detection and Semi-Automatic Revision of Non-Machine-Translatable Parts of a Sentence
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We developed a method for automatically distinguishing the machine-translatable and non-machine-translatable parts of a given sentence for a particular machine translation (MT) system. They can be distinguished by calculating the similarity between a source-language sentence and its back translation for each part of the sentence. The parts with low similarities are highly likely to be non-machine-translatable parts. We showed that the parts of a sentence that are automatically distinguished as non-machine-translatable provide useful information for paraphrasing or revising the sentence in the source language to improve the quality of the translation by the MT system. We also developed a method of providing knowledge useful to effectively paraphrasing or revising the detected non-machine-translatable parts. Two types of knowledge were extracted from the EDR dictionary: one for transforming a lexical entry into an expression used in the definition and the other for conducting the reverse paraphrasing, which transforms an expression found in a definition into the lexical entry. We found that the information provided by the methods helped improve the machine translatability of the originally input sentences.

pdf abs
Semantic Analysis of Abstract Nouns to Compile a Thesaurus of Adjectives
Kyoko Kanzaki | Qing Ma | Eiko Yamamoto | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Aiming to compile a thesaurus of adjectives, we discuss how to extract abstract nouns categorizing adjectives, clarify the semantic and syntactic functions of these abstract nouns, and manually evaluate the capability to extract the instance-category relations. We focused on some Japanese syntactic structures and utilized possibility of omission of abstract noun to decide whether or not a semantic relation between an adjective and an abstract noun is an instance-category relation. For 63% of the adjectives (57 groups/90 groups) in our experiments, our extracted categories were found to be most suitable. For 22 % of the adjectives (20/90), the categories in the EDR lexicon were found to be most suitable. For 14% of the adjectives (13/90), neither our extracted categories nor those in EDR were found to be suitable, or examinees own categories were considered to be more suitable. From our experimental results, we found that the correspondence between a group of adjectives and their category name was more suitable in our method than in the EDR lexicon.

pdf abs
Getting Deeper Semantics than Berkeley FrameNet with MSFA
Kow Kuroda | Masao Utiyama | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper illustrates relevant details of an on-going semantic-role annotation work based on a framework called MULTILAYERED/DIMENSIONAL SEMANTIC FRAME ANALYSIS (MSFA for short) (Kuroda and Isahara, 2005b), which is inspired by, if not derived from, Frame Semantics/Berkeley FrameNet approach to semantic annotation (Lowe et al., 1997; Johnson and Fillmore, 2000).

pdf abs
Word Knowledge Acquisition for Computational Lexicon Construction
Thatsanee Charoenporn | Canasai Kruengkrai | Thanaruk Theeramunkong | Virach Sornlertlamvanich | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The growing of multilingual information processing technology has created the need of linguistic resources, especially lexical database. Many attempts were put to alter the traditional dictionary to computational dictionary, or widely named as computational lexicon. TCLs Computational Lexicon (TCLLEX) is a recent development of a large-scale Thai Lexicon, which aims to serve as a fundamental linguistic resource for natural language processing research. We design either terminology or ontology for structuring the lexicon based on the idea of computability and reusability.

pdf abs
Detection of inconsistencies in concept classifications in a large dictionary — Toward an improvement of the EDR electronic dictionary —
Eiko Yamamoto | Kyoko Kanzaki | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The EDR electronic dictionary is a machine-tractable dictionary developed for advanced computer-based processing of natural lan-guage. This dictionary comprises eleven sub-dictionaries, including a concept dictionary, word dictionaries, bilingual dictionaries, co-occurrence dictionaries, and a technical terminology dictionary. In this study, we focus on the concept dictionary and aim to revise the arrangement of concepts for improving the EDR electronic dictionary. We believe that unsuitable concepts in a class differ from other concepts in the same class from an abstract perspective. From this notion, we first try to automatically extract those concepts unsuited to the class. We then try semi-automatically to amend the concept explications used to explain the meanings to human users and rearrange them in suitable classes. In the experiment, we try to revise those concepts that are the lower-concepts of the concept human in the concept hierarchy and that are directly arranged under concepts with concept explications such as person as defined by and person viewed from . We analyze the result and evaluate our approach.

pdf
Construction of Adverb Dictionary that Relates to Speaker Attitudes and Evaluation of Its Effectiveness
Toshiyuki Kanamaru | Masaki Murata | Hitoshi Isahara
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation

pdf
Chinese Named Entity Recognition with Conditional Random Fields
Wenliang Chen | Yujie Zhang | Hitoshi Isahara
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing

2005

pdf
Organizing English Reading Materials for Vocabulary Learning
Masao Utiyama | Midori Tanimura | Hitoshi Isahara
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf
Analysis of an Iterative Algorithm for Term-Based Ontology Alignment
Shisanu Tongchim | Canasai Kruengkrai | Virach Sornlertlamvanich | Prapass Srichaivattana | Hitoshi Isahara
Second International Joint Conference on Natural Language Processing: Full Papers

pdf
A System to Solve Language Tests for Second Grade Students
Manami Saito | Kazuhide Yamamoto | Satoshi Sekine | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Building an Annotated Japanese-Chinese Parallel Corpus - A Part of NICT Multilingual Corpora
Yujie Zhang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Information Retrieval Capable of Visualization and High Precision
Qing Ma | Kousuke Enomoto | Masaki Murata | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Toward a Unified Evaluation Method for Multiple Reading Support Systems: A Reading Speed-based Procedure
Katsunori Kotani | Takehiko Yoshimi | Takeshi Kutsumi | Ichiko Sata | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf
Trend Survey on Japanese Natural Language Processing Studies over the Last Decade
Masaki Murata | Koji Ichii | Qing Ma | Tamotsu Shirado | Toshiyuki Kanamaru | Hitoshi Isahara
Companion Volume to the Proceedings of Conference including Posters/Demos and tutorial abstracts

pdf bib
Obtaining Japanese Lexical Units for Semantic Frames from Berkeley FrameNet Using a Bilingual Corpus
Toshiyuki Kanamaru | Masaki Murata | Kow Kuroda | Hitoshi Isahara
Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC-2005)

pdf
Error Annotation for Corpus of Japanese Learner English
Emi Izumi | Kiyotaka Uchimoto | Hitoshi Isahara
Proceedings of the Sixth International Workshop on Linguistically Interpreted Corpora (LINC-2005)

Since 1994, China’s HTRDP machine translation evaluation has been conducted for five times. Systems of various translation directions between Chinese, English, Japanese and French have been tested. Both human evaluation and automatic evaluation are conducted in HTRDP evaluation. In recent years, the evaluation was organized jointly with NICT of Japan. This paper introduces some details of this evaluation.

pdf bib abs
Selection of Entries for a Bilingual Dictionary from Aligned Translation Equivalents using Support Vector Machines
Takeshi Kutsumi | Takehiko Yoshimi | Katsunori Kotani | Ichiko Sata | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

This paper claims that constructing a dictionary using bilingual pairs obtained from parallel corpora needs not only correct alignment of two noun phrases but also judgment of its appropriateness as an entry. It specifically addresses the latter task, which has been paid little attention. It demonstrates a method of selecting a suitable entry using Support Vector Machines, and proposes to regard as the features the common and the different parts between a current translation and a new translation. Using experiment results, this paper examines how selection performances are affected by the four ways of representing the common and the different parts: morphemes, parts of speech, semantic markers, and upper-level semantic markers. Moreover, we used n-grams of the common and the different parts of above four kinds of features. Experimental result found that representation by morphemes marked the best performance, F-measure of 0.803.

pdf abs
Building an Annotated Japanese-Chinese Parallel Corpus – A Part of NICT Multilingual Corpora
Yujie Zhang | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated Japanese-Chinese parallel corpus in the world. We created the corpus by selecting Japanese sentences from Mainichi Newspaper and then manually translating them into Chinese. We then annotated the corpus with morphological and syntactic structures and alignments at word and phrase levels. This paper describes the specification in human translation and detailed information annotation, and the tools we developed in the project. The experience we obtained and points we paid special attentions are also introduced for share with other researches in corpora construction.

pdf abs
A Multi-aligner for Japanese-Chinese Parallel Corpora
Yujie Zhang | Qun Liu | Qing Ma | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

Automatic word alignment is an important technology for extracting translation knowledge from parallel corpora. However, automatic techniques cannot resolve this problem completely because of variances in translations. We therefore need to investigate the performance potential of automatic word alignment and then decide how to suitably apply it. In this paper we first propose a lexical knowledge-based approach to word alignment on a Japanese-Chinese corpus. Then we evaluate the performance of the proposed approach on the corpus. At the same time we also apply a statistics-based approach, the well-known toolkit GIZA++, to the same test data. Through comparison of the performances of the two approaches, we propose a multi-aligner, exploiting the lexical knowledge-based aligner and the statistics-based aligner at the same time. Quantitative results confirmed the effectiveness of the multi-aligner.

In this paper, we present evidence that providing users of a speech to speech translation system for emergency diagnosis (MedSLT) with a tool that helps them to learn the coverage greatly improves their success in using the system. In MedSLT, the system uses a grammar-based recogniser that provides more predictable results to the translation component. The help module aims at addressing the lack of robustness inherent in this type of approach. It takes as input the result of a robust statistical recogniser that performs better for out-of-coverage data and produces a list of in-coverage example sentences. These examples are selected from a defined list using a heuristic that prioritises sentences maximising the number of N-grams shared with those extracted from the recognition result.

pdf abs
Automatic Rating of Machine Translatability
Kiyotaka Uchimoto | Naoko Hayashida | Toru Ishida | Hitoshi Isahara
Proceedings of Machine Translation Summit X: Papers

We describe a method for automatically rating the machine translatability of a sentence for various machine translation (MT) systems. The method requires that the MT system can bidirectionally translate sentences in both source and target languages. However, it does not require reference translations, as is usual for automatic MT evaluation. By applying this method to every component of a sentence in a given source language, we can automatically identify the machine-translatable and non-machinetranslatable parts of a sentence for a particular MT system. We show that the parts of a sentence that are automatically identified as nonmachine-translatable provide useful information for paraphrasing or revising the sentence in the source language, thus improving the quality of the final translation.

pdf
Analysis of Machine Translation Systems’ Errors in Tense, Aspect, and Modality
Masaki Murata | Kiyotaka Uchimoto | Qing Ma | Toshiyuki Kanamaru | Hitoshi Isahara
Proceedings of the 19th Pacific Asia Conference on Language, Information and Computation

We describe a new model for dependency structure analysis. This model learns the relationship between two phrasal units called bunsetsus as three categories; ‘between’, ‘dependent’, and ‘beyond’, and estimates the dependency likelihood by considering not only the relationship between two bunsetsus but also the relationship between the left bunsetsu and all of the bunsetsus to its right. We implemented this model based on the maximum entropy model. When using the Kyoto University corpus, the dependency accuracy of our model was 88%, which is about 1% higher than that of the conventional model using exactly the same features.

pdf
Hybrid Neuro and Rule-Based Part of Speech Taggers
Qing Ma | Masaki Murata | Kiyotaka Uchimoto | Hitoshi Isahara
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf
Bunsetsu Identification Using Category-Exclusive Rules
Masaki Murata | Kiyotaka Uchimoto | Qing Ma | Hitoshi Isahara
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

pdf
Backward Beam Search Algorithm for Dependency Analysis of Japanese
Satoshi Sekine | Kiyotaka Uchimoto | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf
Word Order Acquisition from Corpora
Kiyotaka Uchimoto | Masaki Murata | Qing Ma | Satoshi Sekine | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

pdf
A Statistical Approach to the Processing of Metonymy
Masao Utiyama | Masaki Murata | Hitoshi Isahara
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics

1999

pdf
Japanese Dependency Structure Analysis Based on Maximum Entropy Models
Kiyotaka Uchimoto | Satoshi Sekine | Hitoshi Isahara
Ninth Conference of the European Chapter of the Association for Computational Linguistics

pdf
Resolution of Indirect Anaphora in Japanese Sentences Using Examples: “X no Y (Y of X)”
Masaki Murata | Hitoshi Isahara | Makoto Nagao
Coreference and Its Applications

pdf
Pronoun Resolution in Japanese Sentences Using Surface Expressions and Examples
Masaki Murata | Hitoshi Isahara | Makoto Nagao
Coreference and Its Applications

pdf
An example-based approach to Japanese-to-English translation of tense, aspect, and modality
Masaki Murata | Qing Ma | Kiyotaka Uchimoto | Hitoshi Isahara
Proceedings of the 8th Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf
Lexical Semantics to Disambiguate Polysemous Phenomena of Japanese Adnominal Constituents
Hitoshi Isahara | Kyoko Kanzaki
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

1998

pdf
A Multi-Neuro Tagger Using Variable Lengths of Contexts
Qing Ma | Hitoshi Isahara
COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics

pdf bib
Intelligent Network News Reader with Visual User Interface
Hitoshi Isahara | Kiyotaka Uchimoto | Hiromi Ozaku
Content Visualization and Intermedia Representations (CVIR’98)

1997

pdf abs
JEIDA’s Bilingual Corpus and Other Corpora for NLP Research in Japan
Hitoshi Isahara
Proceedings of Machine Translation Summit VI: Papers

The committee on text processing technology of JEIDA (Japan Electronics Industry Development Association) has been developing its bilingual corpus for research on machine translation systems since the 1996 Japanese fiscal year. An overview of this bilingual corpus is presented in this paper. And other linguistic data recently developed in Japan, which includes the RWC text database and the simple sentence data by the CRL and IPA.