2010
pdf
abs
Collection of Usage Information for Language Resources from Academic Articles
Shunsuke Kozawa
|
Hitomi Tohyama
|
Kiyotaka Uchimoto
|
Shigeki Matsubara
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Recently, language resources (LRs) are becoming indispensable for linguistic researches. However, existing LRs are often not fully utilized because their variety of usage is not well known, indicating that their intrinsic value is not recognized very well either. Regarding this issue, lists of usage information might improve LR searches and lead to their efficient use. In this research, therefore, we collect a list of usage information for each LR from academic articles to promote the efficient utilization of LRs. This paper proposes to construct a text corpus annotated with usage information (UI corpus). In particular, we automatically extract sentences containing LR names from academic articles. Then, the extracted sentences are annotated with usage information by two annotators in a cascaded manner. We show that the UI corpus contributes to efficient LR searches by combining the UI corpus with a metadata database of LRs and comparing the number of LRs retrieved with and without the UI corpus.
2008
pdf
Construction of an Infrastructure for Providing Users with Suitable Language Resources
Hitomi Tohyama
|
Shunsuke Kozawa
|
Kiyotaka Uchimoto
|
Shigeki Matsubara
|
Hitoshi Isahara
Coling 2008: Companion volume: Posters
pdf
abs
Automatic Acquisition of Usage Information for Language Resources
Shunsuke Kozawa
|
Hitomi Tohyama
|
Kiyotaka Uchimoto
|
Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Recently, language resources (LRs) are becoming indispensable for linguistic research. Unfortunately, it is not easy to find their usages by searching the web even though they must be described in the Internet or academic articles. This indicates that the intrinsic value of LRs is not recognized very well. In this research, therefore, we extract a list of usage information for each LR to promote the efficient utilization of LRs. In this paper, we proposed a method for extracting a list of usage information from academic articles by using rules based on syntactic information. The rules are generated by focusing on the syntactic features that are observed in the sentences describing usage information. As a result of experiments, we achieved 72.9% in recall and 78.4% in precision for the closed test and 60.9% in recall and 72.7% in precision for the open test.
pdf
abs
Construction of a Metadata Database for Efficient Development and Use of Language Resources
Hitomi Tohyama
|
Shunsuke Kozawa
|
Kiyotaka Uchimoto
|
Shigeki Matsubara
|
Hitoshi Isahara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The National Institute of Information and Communications Technology (NICT) and Nagoya University have been jointly constructing a large scale database named SHACHI by collecting detailed meta-information on language resources (LRs) in Asia and Western countries, for the purpose of effectively combining LRs. The purpose of this project is to investigate languages, tag sets, and formats compiled in LRs throughout the world, to systematically store LR metadata, to create a search function for this information, and to ultimately utilize all this for a more efficient development of LRs. This metadata database contains more than 2,000 compiled LRs such as corpora, dictionaries, thesauruses and lexicons, forming a large scale metadata of LRs archive. Its metadata, an extended version of OLAC metadata set conforming to Dublin Core, which contain detailed meta-information, have been collected semi-automatically. This paper explains the design and the structure of the metadata database, as well as the realization of the catalogue search tool. Additionally, the website of this database is now open to the public and accessible to all Internet users.
pdf
abs
Construction and Analysis of Word-level Time-aligned Simultaneous Interpretation Corpus
Takahiro Ono
|
Hitomi Tohyama
|
Shigeki Matsubara
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper, quantitative analyses of the delay in Japanese-to-English (J-E) and English-to-Japanese (E-J) interpretations are described. The Simultaneous Interpretation Database of Nagoya University (SIDB) was used for the analyses. Beginning time and end time of each word were provided to the corpus using HMM-based phoneme segmentation, and the time lag between the corresponding words was calculated as the word-level delay. Word-level delay was calculated for 3,722 pairs and 4,932 pairs of words for J-E and E-J interpretations, respectively. The analyses revealed that J-E interpretation has much larger delay than E-J interpretation and that the difference of word order between Japanese and English affect the degree of delay.
2006
pdf
abs
Collection of Simultaneous Interpreting Patterns by Using Bilingual Spoken Monologue Corpus
Hitomi Tohyama
|
Shigeki Matsubara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The manual quantitative analysis of CIAIR simultaneous interpretation corpus and the collection of interpreting patterns This paper provides an investigation of simultaneous interpreting patterns using a bilingual spoken monologue corpus. 4,578 pairs of English-Japanese aligned utterances in CIAIR simultaneous interpretation database were used. This investigation is the largest scale as the observation of simultaneous interpreting speech. The simultaneous interpreters are required to generate the target speech simultaneously with the source speech. Therefore, they have various kinds of strategies to raise simultaneity. In this investigation, the simultaneous interpreting patterns with high frequency and high flexibility were extracted from the corpus. As a result, we collected 203 cases out of aligned utterances in which simultaneous interpretersf strategies for raising simultaneity were observed. These 203 cases could be categorized into 12 types of interpreting pattern. It was clarified that 4.5 percent of the English-Japanese monologue data were fitted in those interpreting patterns. These interpreting patterns can be expected to be used as interpreting rules of simultaneous machine interpretation.