Qing Ma


2016

2014

2011

2008

Parallel corpora are critical resources for machine translation research and development since parallel corpora contain translation equivalences of various granularities. Manual annotation of word & phrase alignment is of significance to provide gold-standard for developing and evaluating both example-based machine translation model and statistical machine translation model. This paper presents the work of word & phrase alignment annotation in the NICT Japanese-Chinese parallel corpus, which is constructed at the National Institute of Information and Communications Technology (NICT). We describe the specification of word alignment annotation and the tools specially developed for the manual annotation. The manual annotation on 17,000 sentence pairs has been completed. We examined the manually annotated word alignment data and extracted translation knowledge from the word & phrase aligned corpus.
As a first step to developing systems that enable non-native speakers to output near-perfect English sentences for given mixed English-Japanese sentences, we propose new approaches for selecting English equivalents by using the number of hits for various contexts in large English corpora. As the large English corpora, we not only used the huge amounts of Web data but also the manually compiled large, high-quality English corpora. Using high-quality corpora enables us to accurately select equivalents, and using huge amounts of Web data enables us to resolve the problem of the shortage of hits that normally occurs when using only high-quality corpora. The types and lengths of contexts used to select equivalents are variable and optimally determined according to the number of hits in the corpora, so that performance can be further refined. Computer experiments showed that the precision of our methods was much higher than that of the existing methods for equivalent selection.

2007

2006

Aiming to compile a thesaurus of adjectives, we discuss how to extract abstract nouns categorizing adjectives, clarify the semantic and syntactic functions of these abstract nouns, and manually evaluate the capability to extract the “instance-category” relations. We focused on some Japanese syntactic structures and utilized possibility of omission of abstract noun to decide whether or not a semantic relation between an adjective and an abstract noun is an “instance-category” relation. For 63% of the adjectives (57 groups/90 groups) in our experiments, our extracted categories were found to be most suitable. For 22 % of the adjectives (20/90), the categories in the EDR lexicon were found to be most suitable. For 14% of the adjectives (13/90), neither our extracted categories nor those in EDR were found to be suitable, or examinees’ own categories were considered to be more suitable. From our experimental results, we found that the correspondence between a group of adjectives and their category name was more suitable in our method than in the EDR lexicon.

2005

We are constricting a Japanese-Chinese parallel corpus, which is a part of the NICT Multilingual Corpora. The corpus is general domain, of large scale of about 40,000 sentence pairs, long sentences, annotated with detailed information and high quality. To the best of our knowledge, this will be the first annotated Japanese-Chinese parallel corpus in the world. We created the corpus by selecting Japanese sentences from Mainichi Newspaper and then manually translating them into Chinese. We then annotated the corpus with morphological and syntactic structures and alignments at word and phrase levels. This paper describes the specification in human translation and detailed information annotation, and the tools we developed in the project. The experience we obtained and points we paid special attentions are also introduced for share with other researches in corpora construction.
Automatic word alignment is an important technology for extracting translation knowledge from parallel corpora. However, automatic techniques cannot resolve this problem completely because of variances in translations. We therefore need to investigate the performance potential of automatic word alignment and then decide how to suitably apply it. In this paper we first propose a lexical knowledge-based approach to word alignment on a Japanese-Chinese corpus. Then we evaluate the performance of the proposed approach on the corpus. At the same time we also apply a statistics-based approach, the well-known toolkit GIZA++, to the same test data. Through comparison of the performances of the two approaches, we propose a multi-aligner, exploiting the lexical knowledge-based aligner and the statistics-based aligner at the same time. Quantitative results confirmed the effectiveness of the multi-aligner.

2004

2003

2002

2001

2000

1999

1998