Yi Yuan


2026

Recent advances in Multimodal Entity Linking (MEL) exploit textual and visual information to disambiguate mentions and align them with entities in a knowledge base. Existing methods typically design separate and complex network modules for each type of interaction among multi-granular and multimodal features, while lacking explicit modeling of the joint dependencies among these features. Moreover, most approaches rely on unidirectional retrieval-based matching and lack knowledge-driven verification, leading to unreliable disambiguation in weak-context scenarios. To address these challenges, we propose a novel two-stage MEL framework termed ThinkLinker. First, we introduce a low-rank fusion mechanism to model the joint dependencies among multi-granular and multimodal features, enabling comprehensive and explicit interactions while learning task-relevant discriminative information for candidate ranking in a lower-dimensional space. Subsequently, we develop a bidirectional retrieval-verification paradigm, where the ranked candidate entities guide an LLM-based multi-turn, dialogue-style verification process to generate mention-specific contextual augmentation. The augmented context is then adaptively fused with the original representation to further refine the linking model. Experimental results on public benchmark datasets demonstrate that the proposed ThinkLinker outperforms all state-of-the-art baselines. The code is publicly available at https://github.com/zhouyuanyu/ThinkLinker.
Agentic reinforcement learning enables large language models to solve long-horizon tasks by interacting with the environment and internalizing tool-use behavior into their reasoning. Prior work assigns supervision primarily based on outcome rewards or external reward models, but largely ignores environment observations, a critical source of learning. Consequently, agents may identify successful actions without understanding how the environment responds, producing suboptimal policies. To address this, we propose SOAR (Supervision from Observation for Agentic Reinforcement Learning), which assigns positive advantages to observation tokens proportional to the negative entropy of preceding actions. This encourages the agent to learn from outcomes of confident actions, grounding policy updates in environment dynamics and improving anticipation of tool-call consequences. Empirical results across three domains and 14 benchmarks show that SOAR improves performance, yielding gains of up to 7.0% on general reasoning tasks and 16.9% on deep research tasks, while reducing erroneous and inefficient tool usage.

2025

Direct Preference Optimization (DPO) has recently emerged as an efficient and effective method for aligning large language models with human preferences. However, constructing high-quality preference datasets remains challenging, often necessitating expensive manual or powerful LM annotations. Additionally, standard DPO exhibits suboptimal performance in complex reasoning tasks, such as mathematical and code reasoning. In this paper, we introduce an approach to collect preference pairs through iterative sampling and execution feedback, tailored to the current learning state (e.g. well-learned, mis-learned, and unlearned) of the policy model. To alleviate the failures of DPO and improve its applicability in reasoning tasks, we propose , an iterative uncertainty-aware preference optimization method that achieves fine-grained preference control by assessing model confidence. We validate our approach across three reasoning tasks, incorporating five established reasoning datasets and one self-curated dataset. Our experimental results demonstrate an overall improvement of 3.6% over the standard DPO method and show the model exhibits promising generalizability.

2007