Zhousi Chen
2026
Evaluation of Document-Level Text Simplification in Japanese
Iori Yamashita | Hikari Tanaka | Hajime Kiyama | Kexin Bian | Zhousi Chen | Mamoru Komachi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Iori Yamashita | Hikari Tanaka | Hajime Kiyama | Kexin Bian | Zhousi Chen | Mamoru Komachi
Proceedings of the Fifteenth Language Resources and Evaluation Conference
This study establishes an evaluation framework for document-level text simplification in Japanese by constructing a human-annotated dataset and examining the reliability of LLM-based automatic evaluation. We first developed detailed annotation guidelines covering four criteria—necessity, sufficiency, sentence-level simplicity, and document-level simplicity—and collected human ratings for 1,128 source–target document pairs derived from the Wikipedia part of the Japanese simplification corpus JADOS. Using this dataset, we conducted extensive experiments comparing human judgments with evaluations from large language models, including GPT, Claude, and Gemini. The results show that GPT-4o and Gemini 2.5 Pro achieve high agreement with human annotators even in the 0-shot setting, demonstrating their potential as reliable automatic evaluators for Japanese simplification. However, LLMs exhibited a consistent tendency to underestimate document-level simplicity, particularly for kanji-dense texts or texts with relatively long sentences and a small number of sentences. This work provides the first benchmark for evaluating document-level text simplification in Japanese and offers practical evidence that LLM-based evaluation can support scalable assessment for Japanese document-level simplification.
2025
A Fair Comparison without Translationese: English vs. Target-language Instructions for Multilingual LLMs
Taisei Enomoto | Hwichan Kim | Zhousi Chen | Mamoru Komachi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Taisei Enomoto | Hwichan Kim | Zhousi Chen | Mamoru Komachi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Most large language models are multilingual instruction executors. Prior studies suggested that English instructions are more effective than target-language instructions even for non-English tasks; however, these studies often use datasets and instructions translated from English, which introduce biases known as translationese, hindering an unbiased comparison. To address this issue, we conduct a fair comparison between English and target-language instructions by eliminating translationese effects. Contrary to previous studies, our experiments across several tasks reveal that the advantage of adopting English instructions is not overwhelming. Additionally, we report on the features of generated texts and the instruction-following abilities when using respective instructions.
2024
TMU-HIT’s Submission for the WMT24 Quality Estimation Shared Task: Is GPT-4 a Good Evaluator for Machine Translation?
Ayako Sato | Kyotaro Nakajima | Hwichan Kim | Zhousi Chen | Mamoru Komachi
Proceedings of the Ninth Conference on Machine Translation
Ayako Sato | Kyotaro Nakajima | Hwichan Kim | Zhousi Chen | Mamoru Komachi
Proceedings of the Ninth Conference on Machine Translation
In machine translation quality estimation (QE), translation quality is evaluated automatically without the need for reference translations. This paper describes our contribution to the sentence-level subtask of Task 1 at the Ninth Machine Translation Conference (WMT24), which predicts quality scores for neural MT outputs without reference translations. We fine-tune GPT-4o mini, a large-scale language model (LLM), with limited data for QE.We report results for the direct assessment (DA) method for four language pairs: English-Gujarati (En-Gu), English-Hindi (En-Hi), English-Tamil (En-Ta), and English-Telugu (En-Te).Experiments under zero-shot, few-shot prompting, and fine-tuning settings revealed significantly low performance in the zero-shot, while fine-tuning achieved accuracy comparable to last year’s best scores. Our system demonstrated the effectiveness of this approach in low-resource language QE, securing 1st place in both En-Gu and En-Hi, and 4th place in En-Ta and En-Te.
DejaVu: Disambiguation evaluation dataset for English-JApanese machine translation on VisUal information
Ayako Sato | Tosho Hirasawa | Hwichan Kim | Zhousi Chen | Teruaki Oka | Masato Mita | Mamoru Komachi
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
Ayako Sato | Tosho Hirasawa | Hwichan Kim | Zhousi Chen | Teruaki Oka | Masato Mita | Mamoru Komachi
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
A Survey for LLM Tuning Methods:Classifying Approaches Based on Model Internal Accessibility
Kyotaro Nakajima | Hwichan Kim | Tosho Hirasawa | Taisei Enomoto | Zhousi Chen | Mamoru Komachi
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
Kyotaro Nakajima | Hwichan Kim | Tosho Hirasawa | Taisei Enomoto | Zhousi Chen | Mamoru Komachi
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
2023
Query Generation Using GPT-3 for CLIP-Based Word Sense Disambiguation for Image Retrieval
Xiaomeng Pan | Zhousi Chen | Mamoru Komachi
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
Xiaomeng Pan | Zhousi Chen | Mamoru Komachi
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)
In this study, we propose using the GPT-3 as a query generator for the backend of CLIP as an implicit word sense disambiguation (WSD) component for the SemEval 2023 shared task Visual Word Sense Disambiguation (VWSD). We confirmed previous findings — human-like prompts adapted for WSD with quotes benefit both CLIP and GPT-3, whereas plain phrases or poorly templated prompts give the worst results.
Discontinuous Combinatory Constituency Parsing
Zhousi Chen | Mamoru Komachi
Transactions of the Association for Computational Linguistics, Volume 11
Zhousi Chen | Mamoru Komachi
Transactions of the Association for Computational Linguistics, Volume 11
We extend a pair of continuous combinator-based constituency parsers (one binary and one multi-branching) into a discontinuous pair. Our parsers iteratively compose constituent vectors from word embeddings without any grammar constraints. Their empirical complexities are subquadratic. Our extension includes 1) a swap action for the orientation-based binary model and 2) biaffine attention for the chunker-based multi-branching model. In tests conducted with the Discontinuous Penn Treebank and TIGER Treebank, we achieved state-of-the-art discontinuous accuracy with a significant speed advantage.