Runhai Jiao


2026

Reinforcement learning, with its interpretable path reasoning, has emerged as a promising paradigm for multi-hop question answering over knowledge graphs. However, existing approaches suffer from two inherent limitations: (1) lacking effective intermediate guidance, agents often fall into aimless exploration when confronted with complex multi-hop questions; and (2) policy networks focus on local neighborhood information, making it difficult to anticipate the long-term consequences of decisions. To address these challenges, we propose a Progressive Planning and Reinforced Reasoning (PPRR) framework. Specifically, we introduce large language models as multi-hop reasoning planners, converting decomposed sub-question sequences into stepwise decision guidance and thereby granting the agent human-like, step-by-step problem-solving capabilities. In addition, we design a structure-aware lookahead policy network, which explicitly models inter-node dependencies along the multi-hop reasoning process and performs lookahead value evaluations for candidate actions, thereby enhancing the agent’s global state awareness and decision foresight in complex environments. Finally, we conducted extensive experiments on four public multi-hop question answering benchmarks and one domain-specific dataset. The results demonstrate that our framework surpasses state-of-the-art methods while demonstrating strong generalization.

2025

Multimodal documents, which are among the most prevalent data formats, combine a large amount of textual and visual content. Extracting structured triples knowledge from these documents is a highly valuable task, aimed at helping users efficiently acquire key entities and their relationships. However, existing methods face limitations in simultaneously processing long textual content and multiple associated images for triple extraction. Therefore, we propose a Multimodal Document-level Triple Extraction (MDocTE) framework. Specifically, we introduce a dynamic document graph construction method that extends the model’s scope to the entire document and the external world, while adaptively optimizing the graph structure. Next, we inject the global information and external knowledge learned by the graph neural network into the large language model, generating structured triples after deep interaction. Finally, we design a multimodal relation-aware mechanism and loss function to guide the model in reflecting on the shared information between text and visuals. We release a new triple extraction dataset for multimodal documents and conduct extensive experiments. The results demonstrate that the proposed framework outperforms the state-of-the-art baselines, thus filling the gap in multimodal document extraction. Our data is available at https://github.com/XiangLiphd/Triple-extraction-dataset-for-multimodal-documents.