Yizhuo Li


2026

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.

2025

Temporal Knowledge Graphs (TKGs) are vital for event prediction, yet current methods face limitations. Graph neural networks mainly depend on structural information, often overlooking semantic understanding and requiring high computational costs. Meanwhile, Large Language Models (LLMs) support zero-shot reasoning but lack sufficient capabilities to grasp the laws of historical event development. To tackle these challenges, we introduce a training-free Analogical Replay (AnRe) reasoning framework. Our approach retrieves similar events for queries through semantic-driven clustering and builds comprehensive historical contexts using a dual history extraction module that integrates long-term and short-term history. It then uses LLMs to generate analogical reasoning examples as contextual inputs, enabling the model to deeply understand historical patterns of similar events and improve its ability to predict unknown ones. Our experiments on four benchmarks show that AnRe significantly exceeds traditional training and existing LLM-based methods. Further ablation studies also confirm the effectiveness of the dual history extraction and analogical replay mechanisms.