2022
pdf
abs
Zuo Zhuan Ancient Chinese Dataset for Word Sense Disambiguation
Xiaomeng Pan
|
Hongfei Wang
|
Teruaki Oka
|
Mamoru Komachi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Word Sense Disambiguation (WSD) is a core task in Natural Language Processing (NLP). Ancient Chinese has rarely been used in WSD tasks, however, as no public dataset for ancient Chinese WSD tasks exists. Creation of an ancient Chinese dataset is considered a significant challenge because determining the most appropriate sense in a context is difficult and time-consuming owing to the different usages in ancient and modern Chinese. Actually, no public dataset for ancient Chinese WSD tasks exists. To solve the problem of ancient Chinese WSD, we annotate part of Pre-Qin (221 BC) text Zuo Zhuan using a copyright-free dictionary to create a public sense-tagged dataset. Then, we apply a simple Nearest Neighbors (k-NN) method using a pre-trained language model to the dataset. Our code and dataset will be available on GitHub.
2020
pdf
abs
KOTONOHA: A Corpus Concordance System for Skewer-Searching NINJAL Corpora
Teruaki Oka
|
Yuichi Ishimoto
|
Yutaka Yagi
|
Takenori Nakamura
|
Masayuki Asahara
|
Kikuo Maekawa
|
Toshinobu Ogiso
|
Hanae Koiso
|
Kumiko Sakoda
|
Nobuko Kibe
Proceedings of the Twelfth Language Resources and Evaluation Conference
The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.
2016
pdf
abs
Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language
Teruaki Oka
|
Tomoaki Kono
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
We are constructing an annotated diachronic corpora of the Japanese language. In part of thiswork, we construct a corpus of Manyosyu, which is an old Japanese poetry anthology. In thispaper, we describe how to align the transcribed text and its original text semiautomatically to beable to cross-reference them in our Manyosyu corpus. Although we align the original charactersto the transcribed words manually, we preliminarily align the transcribed and original charactersby using an unsupervised automatic alignment technique of statistical machine translation toalleviate the work. We found that automatic alignment achieves an F1-measure of 0.83; thus, each poem has 1–2 alignment errors. However, finding these errors and modifying them are less workintensiveand more efficient than fully manual annotation. The alignment probabilities can beutilized in this modification. Moreover, we found that we can locate the uncertain transcriptionsin our corpus and compare them to other transcriptions, by using the alignment probabilities.
2011
pdf
Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Teruaki Oka
|
Mamoru Komachi
|
Toshinobu Ogiso
|
Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing