2024
pdf
abs
A Document-Level Text Simplification Dataset for Japanese
Yoshinari Nagai
|
Teruaki Oka
|
Mamoru Komachi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Document-level text simplification, a task that combines single-document summarization and intra-sentence simplification, has garnered significant attention. However, studies have primarily focused on languages such as English and German, leaving Japanese and similar languages underexplored because of a scarcity of linguistic resources. In this study, we devised JADOS, the first Japanese document-level text simplification dataset based on newspaper articles and Wikipedia. Our dataset focuses on simplification, to enhance readability by reducing the number of sentences and tokens in a document. We conducted investigations using our dataset. Firstly, we analyzed the characteristics of Japanese simplification by comparing it across different domains and with English counterparts. Moreover, we experimentally evaluated the performances of text summarization methods, transformer-based text simplification models, and large language models. In terms of D-SARI scores, the transformer-based models performed best across all domains. Finally, we manually evaluated several model outputs and target articles, demonstrating the need for document-level text simplification models in Japanese.
pdf
abs
Token-length Bias in Minimal-pair Paradigm Datasets
Naoya Ueda
|
Masato Mita
|
Teruaki Oka
|
Mamoru Komachi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Minimal-pair paradigm datasets have been used as benchmarks to evaluate the linguistic knowledge of models and provide an unsupervised method of acceptability judgment. The model performances are evaluated based on the percentage of minimal pairs in the MPP dataset where the model assigns a higher sentence log-likelihood to an acceptable sentence than to an unacceptable sentence. Each minimal pair in MPP datasets is controlled to align the number of words per sentence because the sentence length affects the sentence log-likelihood. However, aligning the number of words may be insufficient because recent language models tokenize sentences with subwords. Tokenization may cause a token length difference in minimal pairs, introducing token-length bias that skews the evaluation results. This study demonstrates that MPP datasets suffer from token-length bias and fail to evaluate the linguistic knowledge of a language model correctly. The results proved that sentences with a shorter token length would likely be assigned a higher log-likelihood regardless of their acceptability, which becomes problematic when comparing models with different tokenizers. To address this issue, we propose a debiased minimal pair generation method, allowing MPP datasets to measure language ability correctly and provide comparable results for all models.
2023
pdf
Simultaneous Domain Adaptation of Tokenization and Machine Translation
Taisei Enomoto
|
Tosho Hirasawa
|
Hwichan Kim
|
Teruaki Oka
|
Mamoru Komachi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
pdf
Construction of Evaluation Dataset for Japanese Lexical Semantic Change Detection
Zhidong Ling
|
Taichi Aida
|
Teruaki Oka
|
Mamoru Komachi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
2022
pdf
Japanese Named Entity Recognition from Automatic Speech Recognition Using Pre-trained Models
Seiichiro Kondo
|
Naoya Ueda
|
Teruaki Oka
|
Masakazu Sugiyama
|
Asahi Hentona
|
Mamoru Komachi
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation
pdf
abs
Zuo Zhuan Ancient Chinese Dataset for Word Sense Disambiguation
Xiaomeng Pan
|
Hongfei Wang
|
Teruaki Oka
|
Mamoru Komachi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop
Word Sense Disambiguation (WSD) is a core task in Natural Language Processing (NLP). Ancient Chinese has rarely been used in WSD tasks, however, as no public dataset for ancient Chinese WSD tasks exists. Creation of an ancient Chinese dataset is considered a significant challenge because determining the most appropriate sense in a context is difficult and time-consuming owing to the different usages in ancient and modern Chinese. Actually, no public dataset for ancient Chinese WSD tasks exists. To solve the problem of ancient Chinese WSD, we annotate part of Pre-Qin (221 BC) text Zuo Zhuan using a copyright-free dictionary to create a public sense-tagged dataset. Then, we apply a simple Nearest Neighbors (k-NN) method using a pre-trained language model to the dataset. Our code and dataset will be available on GitHub.
2020
pdf
abs
KOTONOHA: A Corpus Concordance System for Skewer-Searching NINJAL Corpora
Teruaki Oka
|
Yuichi Ishimoto
|
Yutaka Yagi
|
Takenori Nakamura
|
Masayuki Asahara
|
Kikuo Maekawa
|
Toshinobu Ogiso
|
Hanae Koiso
|
Kumiko Sakoda
|
Nobuko Kibe
Proceedings of the Twelfth Language Resources and Evaluation Conference
The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.
2016
pdf
abs
Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language
Teruaki Oka
|
Tomoaki Kono
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)
We are constructing an annotated diachronic corpora of the Japanese language. In part of thiswork, we construct a corpus of Manyosyu, which is an old Japanese poetry anthology. In thispaper, we describe how to align the transcribed text and its original text semiautomatically to beable to cross-reference them in our Manyosyu corpus. Although we align the original charactersto the transcribed words manually, we preliminarily align the transcribed and original charactersby using an unsupervised automatic alignment technique of statistical machine translation toalleviate the work. We found that automatic alignment achieves an F1-measure of 0.83; thus, each poem has 1–2 alignment errors. However, finding these errors and modifying them are less workintensiveand more efficient than fully manual annotation. The alignment probabilities can beutilized in this modification. Moreover, we found that we can locate the uncertain transcriptionsin our corpus and compare them to other transcriptions, by using the alignment probabilities.
2011
pdf
Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Teruaki Oka
|
Mamoru Komachi
|
Toshinobu Ogiso
|
Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing