Hideyuki Tachibana


2026

Since texts generated by large language models (LLMs) may contain misinformation (hallucinations), develop- ing fact-checking systems capable of assessing their veracity has become increasingly important. One of the mainstream approaches to fact-checking is the claim-based one, which first decomposes a generated text into claims, i.e., independent and atomic units of information. Each claim is then used as a query to retrieve supporting evidence, and a verdict is predicted for each claim-evidence pair. Conducting fact-checking at the claim level enhances the explainability of verification results. However, achieving highly accurate verification requires that the text be decomposed into claims at an appropriate level of granularity. To address this, we constructed a dataset for Japanese claim decomposition. As part of this dataset construction, we design detailed guidelines for claim decomposition, ensuring that the extracted claims are in a form useful for fact-checking and that the decomposition rules mitigate annotator variability. Quantitative evaluation confirmed that the constructed dataset is of high quality. Additionally, experiments on prompt-based claim decomposition using the constructed dataset demonstrated that adding high-quality few-shot examples and guidelines to prompts improved performance.
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior resources such as NeuBAROCO and JFLD, which emphasize general or belief-aligned logic, BIS Reasoning 1.0 systematically introduces logically valid yet belief-inconsistent syllogisms to expose belief bias—the tendency to accept believable conclusions irrespective of validity. We benchmark a representative suite of cutting-edge models—including OpenAI GPT-5 variants, GPT-4o, Qwen, and prominent Japanese LLMs—under a uniform, zero-shot protocol. Reasoning-centric models achieve near-perfect accuracy on BIS Reasoning 1.0 (e.g., Qwen3-32B ≈99% and GPT-5-mini up to ≈99.7%), while GPT-4o attains around 80%. Earlier Japanese-specialized models underperform, often well below 60%, whereas the latest llm-jp-3.1-13b-instruct4 markedly improves to the mid-80% range. These results indicate that robustness to belief-inconsistent inputs is driven more by explicit reasoning optimization than by language specialization or scale alone. Our analysis further shows that even top-tier systems falter when logical validity conflicts with intuitive or factual beliefs, and that performance is sensitive to prompt design and inference-time reasoning effort. We discuss implications for safety-critical domains—law, healthcare, and scientific literature—where strict logical fidelity must override intuitive belief to ensure reliability.

2024

Prior work on multilingual sentence embedding has demonstrated that the efficient use of natural language inference (NLI) data to build high-performance models can outperform conventional methods. However, the potential benefits from the recent “exponential” growth of language models with billions of parameters have not yet been fully explored. In this paper, we introduce Multilingual Sentence T5 (m-ST5), as a larger model of NLI-based multilingual sentence embedding, by extending Sentence T5, an existing monolingual model. By employing the low-rank adaptation (LoRA) technique, we have achieved a successful scaling of the model’s size to 5.7 billion parameters. We conducted experiments to evaluate the performance of sentence embedding and verified that the method outperforms the NLI-based prior approach. Furthermore, we also have confirmed a positive correlation between the size of the model and its performance. It was particularly noteworthy that languages with fewer resources or those with less linguistic similarity to English benefited more from the parameter increase. Our model is available at https://huggingface.co/pkshatech/m-ST5.