Lingyun Sun


2026

Multimodal Continual Instruction Tuning (MCIT) is essential for adapting Multimodal Large Language Models (MLLMs) to dynamic data streams, yet preventing catastrophic forgetting remains a major challenge. Existing parameter-efficient approaches often face a dilemma: fixed architectures suffer from knowledge interference, while dynamic strategies incur inefficient capacity expansion, limiting scalability. We propose MoBLoRA (Mixture-of-Bases LoRA), a novel framework for MCIT. Motivated by our geometric analysis revealing subspace redundancy across sequential tasks, MoBLoRA shifts the paradigm from expert selection to subspace mixing: it decomposes adaptation weights into a globally shared pool of orthonormal bases to capture task-invariant knowledge, and lightweight mixing matrices to encode task-specific variations. This design effectively decouples knowledge accumulation from task reconstruction. Experiments on standard benchmarks show MoBLoRA significantly outperforms state-of-the-art methods while maintaining superior parameter efficiency.
Large Language Models (LLMs) exhibit impressive linguistic fluency, yet it remains unclear whether they possess human-like Theory of Mind (ToM) or merely rely on statistical heuristics, particularly in complex social tasks such as irony comprehension. To address the limitations of existing binary benchmarks, this study establishes a multi-dimensional evaluation framework comprising 140 carefully designed probes. These probes are derived from 10 story prototypes based on established cognitive theories. The framework systematically modulates contextual contrast, linguistic cues, and cognitive mechanisms. By comparing the performance of ten state-of-the-art LLMs against 300 human participants, this study uncovers a significant dichotomy in performance. Although LLMs demonstrate superior sensitivity in subsidiary pragmatic inferences, human participants outperform them in holistic irony judgment. Crucially, the results reveal a systematic "intent-irony decoupling", wherein LLMs fail to integrate pragmatic signals into their final judgments. These models exhibit aggressive decision biases and rely on "context-utterance conflict" heuristics. These findings suggest that current LLMs simulate irony comprehension without the underlying cognitive mechanisms. The development of future artificial intelligence may require the integration of explicit ToM modules to bridge the gap between surface-level pattern matching and genuine social understanding.
Large Language Models are increasingly utilized as Role-Playing Agents (RPAs) to simulate personas in interactive settings. However, current RPAs often produce flattened and stereotypical personas with limited depth and fidelity. This limitation arises from two core challenges: insufficient modeling of complex personal histories and internal logic, and ungrounded reasoning that fails to preserve persona coherence as dialogue context evolves. To address these challenges, we propose ThinkPersona, a role-playing agent trained to explicitly ground responses in individual identity. We introduce Persona Graphs as structured representations that encode life trajectories, values, relationships, and events as interconnected knowledge. We construct 1,201 Persona Graphs from real-world interviews and derive a Question–Reasoning–Answer (QRA) dataset of 23,401 samples that supervises reasoning over persona evidence. Fine-tuning on QRA enables ThinkPersona to internalize persona logic and generate persona-consistent responses in long-context dialogues. Experiments on three benchmarks show that ThinkPersona improves role-playing fidelity, behavioral consistency, and grounded reasoning over existing methods, while preserving general instruction-following capabilities. Our code and dataset are available at https://github.com/Hualeez/ThinkPersona.
Current conversational agents often follow static learning paradigms and miss the implicit, evolving feedback embedded in users’ follow-up behaviors. We propose IEvoAgent, an evolving conversational agent framework that leverages the structured dependency between agent responses and user reactions. We construct an annotated dataset from LMSYS-Chat-1M and WildChat and find consistent response-conditioned feedback patterns. Based on this finding, IEvoAgent uses a conditional feedback distribution matrix to estimate expected feedback rewards, combining offline KTO alignment with an inference-time prompt-evolution mechanism driven by a dynamic matrix. Experiments on MT-Bench-101, WildBench, and FB-Bench show improvements over open-source baselines, indicating that mining implicit feedback supports better multi-turn alignment under evolving user preferences. Our code and dataset are available at https://github.com/Hualeez/IEvoAgent.