Bowen Xing


2026

Cross-lingual chain-of-thought (XCoT) with self-consistency markedly enhances multilingual reasoning, yet existing methods remain costly due to extensive sampling of full trajectories across languages. Moreover, multilingual LLM representations vary strongly by language, hindering direct feature comparisons and effective pruning. To address this, we introduce UL-XCoT, the first efficient unified logic cross-lingual reasoning framework that minimizes redundancy in token usage and latency, yielding the greatest efficiency under limited sampling budgets during inference. Specifically, UL-XCoT (1) achieves less languages by selecting, per query, a small candidate language set in a language-invariant unified logic space, (2) enables less tokens by monitoring logic-space trajectory dynamics during decoding to prune low-quality reasoning paths, and (3) aggregates the remaining high-quality trajectories via voting. Experiments on PolyMath across 18 languages with DeepSeek-R1-Distill-Qwen-7B demonstrate that UL-XCoT achieves competitive accuracy while sharply cutting over 50% decoding token cost and latency versus prior sampling baselines. It also delivers more stable gains on low-resource languages, underscoring consistently superior robustness where standard XCoT self-consistency method fails.
Medical large Vision-Language Models (Med-LVLMs) have shown strong potential in multimodal clinical applications such as medical visual question answering and report generation. However, Med-LVLMs remain challenged by hallucinations caused by modality misalignment, where models prioritize textual knowledge over visual evidence and generate outputs that conflict with medical images. To mitigate this issue, recent studies have explored preference optimization to improve image–text alignment, achieving promising results. Despite these advances, existing preference-based methods still face two limitations in medical settings: (1) overfitting to superficial cues, and (2) pseudo convergence of the preference signal. In this paper, we propose Dynamic Evidence-Guided Preference Optimization (DEPO), a new framework that enables evidence-aware and adaptive preference learning for Med-LVLMs. DEPO introduces Multi-Modal Evidence Perturbation (MEP) to suppress non-causal textual and visual shortcuts, and Dispreferred Evidence Resampling (DER) to continuously update dispreferred responses as hallucination patterns evolve. Experiments on multiple medical VQA and report generation benchmarks demonstrate consistent improvements over existing methods, with strong robustness across datasets and architectures. All Codes and data will be released after review.
Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot–value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.

2025

Multi-modal intent recognition (MIR) requires integrating non-verbal cues from real-world contexts to enhance human intention understanding, which has attracted substantial research attention in recent years. Despite promising advancements, a comprehensive survey summarizing recent advances and new frontiers remains absent. To this end, we present a thorough and unified review of MIR, covering different aspects including (1) Extensive survey: we take the first step to present a thorough survey of this research field covering textual, visual (image/video), and acoustic signals. (2) Unified taxonomy: we provide a unified framework including evaluation protocol and advanced methods to summarize the current progress in MIR. (3) Emerging frontiers: We discuss some future directions such as multi-task, multi-domain, and multi-lingual MIR, and give our thoughts respectively. (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope this survey can shed light on future research in MIR.
Relation Triplet Extraction (RTE) is a fundamental while challenge task in knowledge acquisition, which identifies and extracts all triplets from unstructured text. Despite the recent advancements, the deep integration of the entity-, relation- and triplet-specific information remains a challenge. In this paper, we propose a Graph-based Mixture-of-Experts mutual learning framework for RTE, namely RTE-GMoE, to address this limitation. As a model-agnostic framework, RTE-GMoE distinguishes itself by including and modeling the mutual interactions among three vital task-specific experts: entity expert, RTE expert, and relation expert. RTE expert corresponds to the main RTE task and can be implemented by any model and the other two correspond to the two auxiliary tasks: entity recognition and relation extraction. We construct an expert graph and achieve comprehensive and adaptive graph-based MoE interactions with a novel mutual learning mechanism. In our framework, these experts perform knowledge extractions collaboratively via dynamic information exchange and knowledge sharing. We conduct extensive experiments on four state-of-the-art backbones and evaluate them on several widely-used benchmarks. The results demonstrate that our framework brings consistent and promising improvements on all backbones and benchmarks. Component study and model analysis further verify the effectiveness and advantages of our method.
Knowledge base question answering (KBQA) aims to answer natural language questions by reasoning over structured knowledge bases. Existing approaches often struggle with the complexity of mapping questions to precise logical forms, particularly when dealing with diverse entities and relations. In this paper, we propose Hierarchical Topology Multi-task Learning (HTML), a novel framework that leverages a hierarchical multi-task learning paradigm to enhance the performance of logical form generation. Our framework consists of a main task: generating logical forms from questions, and three auxiliary tasks: entity prediction from the input question, relation prediction for the given entities, and logical form generation based on the given entities and relations. Through joint instruction-tuning, HTML allows mutual guidance and knowledge transfer among the hierarchical tasks, capturing the subtle dependencies between entities, relations, and logical forms. Extensive experiments on public benchmarks show that HTML markedly outperforms both supervised fine-tuning methods and training-free ones based on powerful large language models (e.g., GPT-4), demonstrating its superiority in question understanding and structural knowledge reasoning.

2024

In the realm of multi-intent spoken language understanding, recent advancements have leveraged the potential of prompt learning frameworks. However, critical gaps exist in these frameworks: the lack of explicit modeling of dual-task dependencies and the oversight of task-specific semantic differences among utterances. To address these shortcomings, we propose DC-Instruct, a novel generative framework based on Dual-task Inter-dependent Instructions (DII) and Supervised Contrastive Instructions (SCI). Specifically, DII guides large language models (LLMs) to generate labels for one task based on the other task’s labels, thereby explicitly capturing dual-task inter-dependencies. Moreover, SCI leverages utterance semantics differences by guiding LLMs to determine whether a pair of utterances share the same or similar labels. This can improve LLMs on extracting and discriminating task-specific semantics, thus enhancing their SLU reasoning abilities. Extensive experiments on public benchmark datasets show that DC-Instruct markedly outperforms current generative models and state-of-the-art methods, demonstrating its effectiveness in enhancing dialogue language understanding and reasoning.

2022

The task of joint dialog sentiment classification (DSC) and act recognition (DAR) aims to simultaneously predict the sentiment label and act label for each utterance in a dialog. In this paper, we put forward a new framework which models the explicit dependencies via integrating prediction-level interactions other than semantics-level interactions, more consistent with human intuition.Besides, we propose a speaker-aware temporal graph (SATG) and a dual-task relational temporal graph (DRTG) to introduce temporal relations into dialog understanding and dual-task reasoning. To implement our framework, we propose a novel model dubbed DARER, which first generates the context-, speaker- and temporal-sensitive utterance representations via modeling SATG, then conducts recurrent dual-task relational reasoning on DRTG, in which process the estimated label distributions act as key clues in prediction-level interactions.Experiment results show that DARER outperforms existing models by large margins while requiring much less computation resource and costing less training time.Remarkably, on DSC task in Mastodon, DARER gains a relative improvement of about 25% over previous best model in terms of F1, with less than 50% parameters and about only 60% required GPU memory.
Recent graph-based models for joint multiple intent detection and slot filling have obtained promising results through modeling the guidance from the prediction of intents to the decoding of slot filling.However, existing methods (1) only model the unidirectional guidance from intent to slot; (2) adopt homogeneous graphs to model the interactions between the slot semantics nodes and intent label nodes, which limit the performance.In this paper, we propose a novel model termed Co-guiding Net, which implements a two-stage framework achieving the mutual guidances between the two tasks.In the first stage, the initial estimated labels of both tasks are produced, and then they are leveraged in the second stage to model the mutual guidances.Specifically, we propose two heterogeneous graph attention networks working on the proposed two heterogeneous semantics-label graphs, which effectively represent the relations among the semantics nodes and label nodes.Experiment results show that our model outperforms existing models by a large margin, obtaining a relative improvement of 19.3% over the previous best model on MixATIS dataset in overall accuracy.
Recent joint multiple intent detection and slot filling models employ label embeddings to achieve the semantics-label interactions.However, they treat all labels and label embeddings as uncorrelated individuals, ignoring the dependencies among them. Besides, they conduct the decoding for the two tasks independently, without leveraging the correlations between them.Therefore, in this paper, we first construct a Heterogeneous Label Graph (HLG) containing two kinds of topologies: (1) statistical dependencies based on labels’ co-occurrence patterns and hierarchies in slot labels; (2) rich relations among the label nodes.Then we propose a novel model termed ReLa-Net.It can capture beneficial correlations among the labels from HLG.The label correlations are leveraged to enhance semantic-label interactions. Moreover, we also propose the label-aware inter-dependent decoding mechanism to further exploit the label correlations for decoding. Experiment results show that our ReLa-Net significantly outperforms previous models.Remarkably, ReLa-Net surpasses the previous best model by over 20% in terms of overall accuracy on MixATIS dataset.