Shengzhe Li


2026

We present JMTEB, a large-scale evaluation suite for Japanese text embedding models, designed to provide comprehensive coverage across multiple task types. The benchmark integrates 28 datasets across 5 tasks, enabling broad and challenging evaluation of model performance in diverse scenarios. While the full benchmark delivers thorough assessment, its scale poses practical challenges in terms of computation time and resource requirements. To address this, we construct JMTEB-lite, a lightweight version of JMTEB, by substantially reducing corpus size in retrieval-related tasks. JMTEB-lite significantly accelerates evaluation while maintaining high fidelity to the full benchmark. Together, JMTEB and JMTEB-lite form a flexible evaluation framework: the full version serves as a comprehensive standard for exhaustive benchmarking, while the lightweight version enables rapid iteration and efficient model selection. This dual approach facilitates both rigorous evaluation and practical development workflows, supporting the advancement of Japanese text embedding research.
Retrieval-augmented generation (RAG) is a technique in which a large language model (LLM) generates answers based on relevant documents retrieved from an external document collection. Existing RAG evaluation benchmarks often use public data, such as Wikipedia and news articles, as the external document collection. However, these data are highly likely to be already included in the LLM’s pre-training corpus, which may prevent an accurate evaluation of the model’s ability to generate answers based on the retrieved documents. In this study, we construct a Japanese RAG benchmark by having an LLM synthesize documents about non-existent entities and events and use this collection of synthetic documents as the search target. Since these synthetic documents are not included in the LLM’s training data, the ability to generate answers based on retrieved documents can be evaluated more accurately. In addition to the synthetic documents, the benchmark is composed of questions and correct answers, which are created using a combination of LLMs and human effort. We then evaluated and analyzed the RAG performance of existing LLMs using the constructed benchmark.

2023

Pretrained language models require the use of consistent segmentation (e.g., subword- or character-level segmentation) in pretraining and finetuning. In NLP, many tasks are modeled by subword-level segmentation better than by character-level segmentation. However, because of their format, several tasks require the use of character-level segmentation. Thus, in order to tackle both types of NLP tasks, language models must be independently pretrained for both subword and character-level segmentation. However, this is an inefficient and costly procedure. Instead, this paper proposes a method for training a language model with unified segmentation. This means that the trained model can be finetuned on both subword- and character-level segmentation. The principle of the method is to apply the subword regularization technique to generate a mixture of subword- and character-level segmentation. Through experiment on BERT models, we demonstrate that our method can halve the computational cost of pretraining.

2022

Dialogue systems without consistent responses are not attractive. In this study, we build a dialogue system that can respond based on a given character setting (persona) to bring consistency. Considering the trend of the rapidly increasing scale of language models, we propose an approach that uses prompt-tuning, which has low learning costs, on pre-trained large-scale language models. The results of the automatic and manual evaluations in English and Japanese show that it is possible to build a dialogue system with more natural and personalized responses with less computational resources than fine-tuning.