Akihiko Fukuchi


2026

We present JMTEB, a large-scale evaluation suite for Japanese text embedding models, designed to provide comprehensive coverage across multiple task types. The benchmark integrates 28 datasets across 5 tasks, enabling broad and challenging evaluation of model performance in diverse scenarios. While the full benchmark delivers thorough assessment, its scale poses practical challenges in terms of computation time and resource requirements. To address this, we construct JMTEB-lite, a lightweight version of JMTEB, by substantially reducing corpus size in retrieval-related tasks. JMTEB-lite significantly accelerates evaluation while maintaining high fidelity to the full benchmark. Together, JMTEB and JMTEB-lite form a flexible evaluation framework: the full version serves as a comprehensive standard for exhaustive benchmarking, while the lightweight version enables rapid iteration and efficient model selection. This dual approach facilitates both rigorous evaluation and practical development workflows, supporting the advancement of Japanese text embedding research.
Retrieval-augmented generation (RAG) is a technique in which a large language model (LLM) generates answers based on relevant documents retrieved from an external document collection. Existing RAG evaluation benchmarks often use public data, such as Wikipedia and news articles, as the external document collection. However, these data are highly likely to be already included in the LLM’s pre-training corpus, which may prevent an accurate evaluation of the model’s ability to generate answers based on the retrieved documents. In this study, we construct a Japanese RAG benchmark by having an LLM synthesize documents about non-existent entities and events and use this collection of synthetic documents as the search target. Since these synthetic documents are not included in the LLM’s training data, the ability to generate answers based on retrieved documents can be evaluated more accurately. In addition to the synthetic documents, the benchmark is composed of questions and correct answers, which are created using a combination of LLMs and human effort. We then evaluated and analyzed the RAG performance of existing LLMs using the constructed benchmark.

2024

Prior work on multilingual sentence embedding has demonstrated that the efficient use of natural language inference (NLI) data to build high-performance models can outperform conventional methods. However, the potential benefits from the recent “exponential” growth of language models with billions of parameters have not yet been fully explored. In this paper, we introduce Multilingual Sentence T5 (m-ST5), as a larger model of NLI-based multilingual sentence embedding, by extending Sentence T5, an existing monolingual model. By employing the low-rank adaptation (LoRA) technique, we have achieved a successful scaling of the model’s size to 5.7 billion parameters. We conducted experiments to evaluate the performance of sentence embedding and verified that the method outperforms the NLI-based prior approach. Furthermore, we also have confirmed a positive correlation between the size of the model and its performance. It was particularly noteworthy that languages with fewer resources or those with less linguistic similarity to English benefited more from the parameter increase. Our model is available at https://huggingface.co/pkshatech/m-ST5.