Masahiro Hamasaki


2026

Large Language Models (LLMs) provide flexible natural language processing capabilities, while knowledge graphs (KGs) offer explicit and structured knowledge. Integrating these two in a complementary manner enables the development of reliable and verifiable AI systems. In particular, knowledge graph question answering (KGQA) has attracted attention as a means to reduce LLM hallucinations and to leverage knowledge beyond the training data. However, existing KGQA benchmark datasets are biased toward encyclopedic knowledge, limited to a single modality, and lack fine-grained spatiotemporal data, which limits their applicability to real-world scenarios targeted by Embodied AI. We introduce HOME-KGQA, a novel KGQA benchmark dataset built on a multimodal KG of daily household activities. HOME-KGQA consists of complex, multi-hop natural language questions paired with graph database query languages. Compared to existing benchmarks, it includes more challenging questions that involve multi-level spatiotemporal reasoning, multimodal grounding, and aggregate functions. Experimental results show that the LLM-based KGQA methods fail to achieve performance comparable to that on existing datasets when evaluated on HOME-KGQA. This highlights significant challenges that should be addressed for the real-world deployment of KGQA systems. Our dataset is available at https://github.com/aistairc/home-kgqa.
Quantifying uncertainty in large language models (LLMs) is crucial for applications where safety is a concern, as it helps identify factually incorrect LLM answers, commonly referred to as hallucinations. Recently, advancements have been made in quantifying uncertainty, specifically by incorporating the semantics of sampled answers to estimate entropy. These methods typically rely on a normalized probability that is calculated using a limited number of sampled answers. However, we note these estimation methods fail to account for the effects of the semantics that are possible to be obtained as answers, but are not observed in the sample. This is a significant oversight, since a heavier tail of unobserved answer probabilities indicates a higher level of overall uncertainty. To alleviate this issue, we propose Evidential Semantic Entropy (EVSE), which leverages evidence theory to represent both total ignorance arising from unobserved answers and partial ignorance stemming from the semantic relationships among the observed answers. Experiments show that EVSE significantly improves uncertainty quantification performance. Our code is available at: https://github.com/lucieK-J/EvidentialSemanticEntropy.git.