Teruaki Oka

2026

JamC-QA: A Multiple-Choice Question Answering Benchmark for Japan-Specific Knowledge
Teruaki Oka | Tomohide Shibata | Nao Yoshida
Proceedings of the Fifteenth Language Resources and Evaluation Conference

We introduce JamC-QA, a multiple-choice question answering benchmark specifically designed to evaluate Japan-specific knowledge. Existing Japanese QA benchmarks largely consist of questions translated from English or derived from professional exams, primarily targeting academic or generally shared knowledge. Consequently, this limits the usefulness of distinguishing the performance of high-performing Large Language Models on local knowledge acquisition. To address this, JamC-QA serves as a robust resource for assessing the acquisition of Japan-specific knowledge. It comprises 2,309 challenging instances that were created entirely from scratch by human annotators across eight categories: culture, custom, regional identity, geography, history, government, law, and healthcare. Instances that were easily answerable by weak models were filtered out. Evaluation results highlight the critical distinction between model types: while multilingual models scored highly on general benchmarks like MMLU and JMMLU, the results on JamC-QA indicate that they do not fully capture Japan-specific knowledge. Japanese-language models outperform multilingual models, especially on culture- and region-related knowledge such as proverbs, traditional events, and local customs. Furthermore, we find a notable division within Japanese models: models further pretrained on Japanese text excel at administrative and legal questions, while models trained from scratch perform strongly on local and cultural aspects.

2024

pdf bib abs

Token-length Bias in Minimal-pair Paradigm Datasets
Naoya Ueda | Masato Mita | Teruaki Oka | Mamoru Komachi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Minimal-pair paradigm datasets have been used as benchmarks to evaluate the linguistic knowledge of models and provide an unsupervised method of acceptability judgment. The model performances are evaluated based on the percentage of minimal pairs in the MPP dataset where the model assigns a higher sentence log-likelihood to an acceptable sentence than to an unacceptable sentence. Each minimal pair in MPP datasets is controlled to align the number of words per sentence because the sentence length affects the sentence log-likelihood. However, aligning the number of words may be insufficient because recent language models tokenize sentences with subwords. Tokenization may cause a token length difference in minimal pairs, introducing token-length bias that skews the evaluation results. This study demonstrates that MPP datasets suffer from token-length bias and fail to evaluate the linguistic knowledge of a language model correctly. The results proved that sentences with a shorter token length would likely be assigned a higher log-likelihood regardless of their acceptability, which becomes problematic when comparing models with different tokenizers. To address this issue, we propose a debiased minimal pair generation method, allowing MPP datasets to measure language ability correctly and provide comparable results for all models.

pdf bib abs

A Document-Level Text Simplification Dataset for Japanese
Yoshinari Nagai | Teruaki Oka | Mamoru Komachi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Document-level text simplification, a task that combines single-document summarization and intra-sentence simplification, has garnered significant attention. However, studies have primarily focused on languages such as English and German, leaving Japanese and similar languages underexplored because of a scarcity of linguistic resources. In this study, we devised JADOS, the first Japanese document-level text simplification dataset based on newspaper articles and Wikipedia. Our dataset focuses on simplification, to enhance readability by reducing the number of sentences and tokens in a document. We conducted investigations using our dataset. Firstly, we analyzed the characteristics of Japanese simplification by comparing it across different domains and with English counterparts. Moreover, we experimentally evaluated the performances of text summarization methods, transformer-based text simplification models, and large language models. In terms of D-SARI scores, the transformer-based models performed best across all domains. Finally, we manually evaluated several model outputs and target articles, demonstrating the need for document-level text simplification models in Japanese.

pdf bib

2023

pdf bib

Construction of Evaluation Dataset for Japanese Lexical Semantic Change Detection
Zhidong Ling | Taichi Aida | Teruaki Oka | Mamoru Komachi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib

Simultaneous Domain Adaptation of Tokenization and Machine Translation
Taisei Enomoto | Tosho Hirasawa | Hwichan Kim | Teruaki Oka | Mamoru Komachi
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

2022

pdf bib abs

Zuo Zhuan Ancient Chinese Dataset for Word Sense Disambiguation
Xiaomeng Pan | Hongfei Wang | Teruaki Oka | Mamoru Komachi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Student Research Workshop

Word Sense Disambiguation (WSD) is a core task in Natural Language Processing (NLP). Ancient Chinese has rarely been used in WSD tasks, however, as no public dataset for ancient Chinese WSD tasks exists. Creation of an ancient Chinese dataset is considered a significant challenge because determining the most appropriate sense in a context is difficult and time-consuming owing to the different usages in ancient and modern Chinese. Actually, no public dataset for ancient Chinese WSD tasks exists. To solve the problem of ancient Chinese WSD, we annotate part of Pre-Qin (221 BC) text Zuo Zhuan using a copyright-free dictionary to create a public sense-tagged dataset. Then, we apply a simple Nearest Neighbors (k-NN) method using a pre-trained language model to the dataset. Our code and dataset will be available on GitHub.

pdf bib

Japanese Named Entity Recognition from Automatic Speech Recognition Using Pre-trained Models
Seiichiro Kondo | Naoya Ueda | Teruaki Oka | Masakazu Sugiyama | Asahi Hentona | Mamoru Komachi
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation

2020

pdf bib abs

The National Institute for Japanese Language and Linguistics, Japan (NINJAL, Japan), has developed several types of corpora. For each corpus NINJAL provided an online search environment, ‘Chunagon’, which is a morphological-information-annotation-based concordance system made publicly available in 2011. NINJAL has now provided a skewer-search system ‘Kotonoha’ based on the ‘Chunagon’ systems. This system enables querying of multiple corpora by certain categories, such as register type and period.

2016

pdf bib abs

Original-Transcribed Text Alignment for Manyosyu Written by Old Japanese Language
Teruaki Oka | Tomoaki Kono
Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

We are constructing an annotated diachronic corpora of the Japanese language. In part of thiswork, we construct a corpus of Manyosyu, which is an old Japanese poetry anthology. In thispaper, we describe how to align the transcribed text and its original text semiautomatically to beable to cross-reference them in our Manyosyu corpus. Although we align the original charactersto the transcribed words manually, we preliminarily align the transcribed and original charactersby using an unsupervised automatic alignment technique of statistical machine translation toalleviate the work. We found that automatic alignment achieves an F1-measure of 0.83; thus, each poem has 1–2 alignment errors. However, finding these errors and modifying them are less workintensiveand more efficient than fully manual annotation. The alignment probabilities can beutilized in this modification. Moreover, we found that we can locate the uncertain transcriptionsin our corpus and compare them to other transcriptions, by using the alignment probabilities.

2011

pdf bib

Automatic Labeling of Voiced Consonants for Morphological Analysis of Modern Japanese Literature
Teruaki Oka | Mamoru Komachi | Toshinobu Ogiso | Yuji Matsumoto
Proceedings of 5th International Joint Conference on Natural Language Processing

Co-authors

Venues

NAACL1

Fix author