Jun Liu
2026
MUR: Momentum Uncertainty guided Reasoning for Large Language Models
Hang Yan | Fangzhi Xu | Rongman Xu | Yifei Li | Jian Zhang | Haoran Luo | Xiaobao Wu | Anh Tuan Luu | Haiteng Zhao | Qika Lin | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hang Yan | Fangzhi Xu | Rongman Xu | Yifei Li | Jian Zhang | Haoran Luo | Xiaobao Wu | Anh Tuan Luu | Haiteng Zhao | Qika Lin | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking—wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM TTS without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating step-wise uncertainty over time. To support flexible inference-time control, we introduce -control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 45% on average while improving accuracy by 0.33–3.46%.
PhysPRM: A Generative Process Reward Model with Fine-grained Diagnosis for Physics Problem Solving
Yuxuan Dong | Xinyu Zhang | Lingling Zhang | Han Lai | Pengyu Li | Bifan Wei | Yaqiang Wu | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
Yuxuan Dong | Xinyu Zhang | Lingling Zhang | Han Lai | Pengyu Li | Bifan Wei | Yaqiang Wu | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
Despite the remarkable progress of Large Language Models (LLMs) in abstract reasoning tasks, they continue to struggle with physics problem solving due to difficulties in decoding implicit constraints and maintaining physical consistency. To address these challenges, Process Reward Models (PRMs) have emerged as a promising approach to verify intermediate reasoning steps. Existing PRMs attempt to mitigate reasoning errors but typically rely on scalar scoring, which lacks the explanatory power necessary to diagnose complex physical misconceptions. In this work, we introduce PhysPRM, a Generative PRM that treats evaluation as a generative task to produce fine-grained diagnoses comprising critiques, final judgments, and specific error types. To facilitate this, we develop an automated data synthesis pipeline to construct PhysPRM30K, a comprehensive training dataset, and PhysProcessBench, a rigorously human-verified benchmark. By employing a two-stage training paradigm that integrates Supervised Fine-Tuning with Group Relative Policy Optimization, PhysPRM significantly enhances the physics reasoning capabilities of various LLMs. Extensive experiments demonstrate that PhysPRM improves performance across seven benchmarks in both Best-of-N and critique refinement strategies.
Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents
Yifei Li | Weidong Guo | Lingling Zhang | Rongman Xu | Muye Huang | Hui Liu | Lijiao Xu | Yu Xu | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifei Li | Weidong Guo | Lingling Zhang | Rongman Xu | Muye Huang | Hui Liu | Lijiao Xu | Yu Xu | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Long-term conversational memory is a core capability for LLM-baseddialogue systems, yet existing benchmarks and evaluation protocolsprimarily focus on surface-level factual recall.In realistic interactions, appropriate responses often depend onimplicit constraints such as user state, goals, or values that are notexplicitly queried later.To evaluate this setting, we introduce LoCoMo-Plus, a benchmarkfor assessing cognitive memory under cue–trigger semantic disconnect,where models must retain and apply latent constraints across longconversational contexts.We further show that conventional string-matching metrics and explicittask-type prompting are misaligned with such scenarios, and propose aunified evaluation framework based on constraint consistency.Experiments across diverse backbone models, retrieval-based methods, andmemory systems demonstrate that cognitive memory remains challenging andreveals failures not captured by existing benchmarks.Our code and evaluation framework are publicly available at https://github.com/xjtuleeyf/Locomo-Plus.
VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG
Honghao Fu | Miao Xu | Yiwei Wang | Dailing Zhang | Jun Liu | Yujun Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Honghao Fu | Miao Xu | Yiwei Wang | Dailing Zhang | Jun Liu | Yujun Cai
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query’s intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query’s reasoning intent. To support this capability, we curate IR-600K, a large-scale dataset tailored for learning frame–query intent alignment. Experiments show that VideoStir is competitive with state-of-the-art baselines without relying on auxiliary information, highlighting the promise of shifting long-video RAG from flattened semantic matching to structured, intent-aware reasoning. Codes and checkpoints are available at https://github.com/RomGai/VideoStir.
GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
Yumeng Fu | Jiayin Zhu | Lingling Zhang | Wenjun Wu | Bo Zhao | Shaoxuan Ma | Yushun Zhang | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yumeng Fu | Jiayin Zhu | Lingling Zhang | Wenjun Wu | Bo Zhao | Shaoxuan Ma | Yushun Zhang | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models’ understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux
AGTAO: Robust and Stabilized LLM Unlearning via Adversarial Gating Training with Adaptive Orthogonality
Pengyu Li | Lingling Zhang | Zhitao Gao | Yanrui Wu | Yuxuan Dong | Huan Liu | Bifan Wei | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
Pengyu Li | Lingling Zhang | Zhitao Gao | Yanrui Wu | Yuxuan Dong | Huan Liu | Bifan Wei | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
While Large Language Models (LLMs) have achieved remarkable capabilities, they unintentionally memorize sensitive data, posing critical privacy and security risks.Machine unlearning is pivotal for mitigating these risks, yet existing paradigms face a fundamental dilemma: aggressive unlearning often induces catastrophic forgetting that degrades model utility, whereas conservative strategies risk superficial forgetting, leaving models vulnerable to adversarial recovery. To address this trade-off, we propose AGTAO (Adversarial Gating Training with Adaptive Orthogonality), a unified framework designed to reconcile robust erasure with utility preservation. Specifically, our approach introduces Adaptive Orthogonality (AO) to dynamically mitigate geometric gradient conflicts between forgetting and retention objectives, thereby minimizing unintended knowledge degradation. Concurrently, Adversarial Gating Training (AGT) formulates unlearning as a latent-space min-max game, employing a curriculum-based gating mechanism to simulate and counter internal recovery attempts. Extensive experiments demonstrate that AGTAO achieves a superior trade-off between unlearning efficacy (KUR ≈ 0.01) and model utility (MMLU 58.30).[Code is available at <https://anonymous.4open.science/r/AGT-unlearning>.].
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
Xinyu Zhang | Yuchen Wan | Boxuan Zhang | Zesheng Yang | Lingling Zhang | Bifan Wei | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyu Zhang | Yuchen Wan | Boxuan Zhang | Zesheng Yang | Lingling Zhang | Bifan Wei | Jun Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) often struggle with structural ambiguity in optimization problems, where a single problem admits multiple related but conflicting modeling paradigms, hindering effective solution generation. To address this, we propose Dual-Cluster Memory Agent (DCM-Agent) to enhance performance by leveraging historical solutions in a training-free manner. Central to this is Dual-Cluster Memory Construction. This agent assigns historical solutions to modeling and coding clusters, then distills each cluster’s content into three structured types: Approach, Checklist, and Pitfall. This process derives generalizable guidance knowledge. Furthermore, this agent introduces Memory-augmented Inference to dynamically navigate solution paths, detect and repair errors, and adaptively switch reasoning paths with structured knowledge. The experiments across seven optimization benchmarks demonstrate that DCM-Agent achieves an average performance improvement of 11%- 21%. Notably, our analysis reveals a “knowledge inheritance” phenomenon: memory constructed by larger models can guide smaller models toward superior performance, highlighting the framework’s scalability and efficiency.
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
Xinyu Zhang | Boxuan Zhang | Yuchen Wan | Lingling Zhang | YiXing Yao | Bifan Wei | Yaqiang Wu | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
Xinyu Zhang | Boxuan Zhang | Yuchen Wan | Lingling Zhang | YiXing Yao | Bifan Wei | Yaqiang Wu | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
While Large Language Models (LLMs) demonstrate remarkable reasoning, complex optimization tasks remain challenging, requiring domain knowledge and robust implementation. However, existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation. To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 curated problems spanning neglected domains, including Stochastic Optimization, Dynamic Optimization, Game Optimization, and Optimal Control, across three difficulty levels: Easy, Medium, and Hard. The experiments with 22 LLMs of different sizes reveal sharp performance degradation on hard problems, where even advanced models like GPT-5.2 and Gemini-3 struggle to exceed 27% accuracy. Through error analysis, we identify that modeling logic errors remain the primary bottleneck. Consequently, we propose a Dual-View Auditor Agent that improves the accuracy of the LLM modeling process without introducing significant time overhead. OptiVerse will serve as a foundational platform for advancing LLMs in solving complex optimization challenges.
MAXS: Meta-Adaptive Exploration with LLM Agents
Jian Zhang | Zhiyuan Wang | Zhangqi Wang | Yu He | Haoran Luo | li Yuan | Lingling Zhang | Rui Mao | Qika Lin | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
Jian Zhang | Zhiyuan Wang | Zhangqi Wang | Yu He | Haoran Luo | li Yuan | Lingling Zhang | Rui Mao | Qika Lin | Jun Liu
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Model (LLM) Agents exhibit inherent reasoning abilities through the collaboration of multiple tools.However, during agent inference, existing methods often suffer from (i) locally myopic generation, due to the absence of lookahead, and (ii) trajectory instability, where minor early errors can escalate into divergent reasoning paths. These issues make it difficult to balance global effectiveness and computational efficiency. To address these two issues, we propose meta-adaptive exploration with LLM agents (MAXS)[<https://github.com/exoskeletonzj/MAXS>], a meta-adaptive reasoning framework based on LLM Agents that flexibly integrates tool execution and reasoning planning. MAXS employs a lookahead strategy to extend reasoning paths a few steps ahead, estimating the advantage value of tool usage, and combines step consistency variance and inter-step trend slopes to jointly select stable, consistent, and high-value reasoning steps. Additionally, we introduce a trajectory convergence mechanism that controls computational cost by halting further rollouts once path consistency is achieved, enabling a balance between resource efficiency and global effectiveness in multi-tool reasoning. We conduct extensive empirical studies across three base models (MiMo-VL-7B, Qwen2.5-VL-7B, Qwen2.5-VL-32B) and five datasets, demonstrating that MAXS consistently outperforms existing methods in both performance and inference efficiency. Further analysis confirms the effectiveness of our lookahead strategy and tool usage.
2025
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
Xinyu Zhang | Yuxuan Dong | Yanrui Wu | Jiaxing Huang | Chengyou Jia | Basura Fernando | Mike Zheng Shou | Lingling Zhang | Jun Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xinyu Zhang | Yuxuan Dong | Yanrui Wu | Jiaxing Huang | Chengyou Jia | Basura Fernando | Mike Zheng Shou | Lingling Zhang | Jun Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models.
Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
Fangzhi Xu | Qiushi Sun | Kanzhi Cheng | Jun Liu | Yu Qiao | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fangzhi Xu | Qiushi Sun | Kanzhi Cheng | Jun Liu | Yu Qiao | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning. This inspired researchers to investigate self-training methods to mitigate the extensive reliance on human annotations. However, the current success of self-training has been primarily observed in natural language scenarios, rather than in the increasingly important neural-symbolic scenarios. To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS. It aims to overcome two main challenges: (1) the scarcity of symbolic data, and (2) the limited proficiency of LLMs in processing symbolic language. Extensive evaluations conducted on three distinct domains demonstrate the effectiveness of our approach. Additionally, we have conducted a comprehensive analysis to uncover the factors contributing to ENVISIONS’s success, thereby offering valuable insights for future research in this area.
ResoFilter: Fine-grained Synthetic Data Filtering for Large Language Models through Data-Parameter Resonance Analysis
Zeao Tu | Xiangdi Meng | Yu He | Zihan Yao | Tianyu Qi | Jun Liu | Ming Li
Findings of the Association for Computational Linguistics: NAACL 2025
Zeao Tu | Xiangdi Meng | Yu He | Zihan Yao | Tianyu Qi | Jun Liu | Ming Li
Findings of the Association for Computational Linguistics: NAACL 2025
Large language models (LLMs) have shown remarkable effectiveness across various domains, with data augmentation methods utilizing GPT for synthetic data generation becoming prevalent. However, the quality and utility of augmented data remain questionable, and current methods lack clear metrics for evaluating data characteristics. To address these challenges, we propose ResoFilter, a novel method that integrates models, data, and tasks to refine datasets. ResoFilter leverages the fine-tuning process to obtain Data-Parameter features for data selection, offering improved interpretability by representing data characteristics through model weights. Our experiments demonstrate that ResoFilter achieves comparable results to full-scale fine-tuning using only half the data in mathematical tasks and exhibits strong generalization across different models and domains. This method provides valuable insights for constructing synthetic datasets and evaluating high-quality data, offering a promising solution for enhancing data augmentation techniques and improving training dataset quality for LLMs. For reproducibility, we will release our code and data upon acceptance.
𝜙-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
Fangzhi Xu | Hang Yan | Chang Ma | Haiteng Zhao | Jun Liu | Qika Lin | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fangzhi Xu | Hang Yan | Chang Ma | Haiteng Zhao | Jun Liu | Qika Lin | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named 𝜙-Decoding. To provide a precise and expressive estimation of step value, 𝜙-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show 𝜙-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets.
GUICourse: From General Vision Language Model to Versatile GUI Agent
Wentong Chen | Junbo Cui | Jinyi Hu | Yujia Qin | Junjie Fang | Yue Zhao | Chongyi Wang | Jun Liu | Guirong Chen | Yupeng Huo | Yuan Yao | Yankai Lin | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wentong Chen | Junbo Cui | Jinyi Hu | Yujia Qin | Junjie Fang | Yue Zhao | Chongyi Wang | Jun Liu | Guirong Chen | Yupeng Huo | Yuan Yao | Yankai Lin | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Utilizing Graphic User Interfaces (GUIs) for human-computer interaction is essential for accessing various digital tools. Recent advancements in Vision Language Models (VLMs) reveal significant potential for developing versatile agents that assist humans in navigating GUIs. However, current VLMs face challenges related to fundamental abilities, such as OCR and grounding, as well as a lack of knowledge about GUI elements functionalities and control methods. These limitations hinder their effectiveness as practical GUI agents. To address these challenges, we introduce GUICourse, a series of datasets for training visual-based GUI agents using general VLMs. First, we enhance the OCR and grounding capabilities of VLMs using the GUIEnv dataset. Next, we enrich the GUI knowledge of VLMs using the GUIAct and GUIChat datasets. Our experiments demonstrate that even a small-sized GUI agent (with 3.1 billion parameters) performs effectively on both single-step and multi-step GUI tasks. We further finetune our GUI agents on other GUI tasks with different action spaces (AITW and Mind2Web), and the results show that our agents are better than their baseline VLMs. Additionally, we analyze the impact of OCR and grounding capabilities through an ablation study, revealing a positive correlation with GUI navigation ability.
Diagram-Driven Course Questions Generation
Xinyu Zhang | Lingling Zhang | Yanrui Wu | Muye Huang | Wenjun Wu | Bo Li | Shaowei Wang | Basura Fernando | Jun Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Xinyu Zhang | Lingling Zhang | Yanrui Wu | Muye Huang | Wenjun Wu | Bo Li | Shaowei Wang | Basura Fernando | Jun Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Visual Question Generation (VQG) research focuses predominantly on natural images while neglecting the diagram, which is a critical component in educational materials. To meet the needs of pedagogical assessment, we propose the Diagram-Driven Course Questions Generation (DDCQG) task and construct DiagramQG, a comprehensive dataset with 15,720 diagrams and 25,798 questions across 37 subjects and 371 courses. Our approach employs course and input text constraints to generate course-relevant questions about specific diagram elements. We reveal three challenges of DDCQG: domain-specific knowledge requirements across courses, long-tail distribution in course coverage, and high information density in diagrams. To address these, we propose the Hierarchical Knowledge Integration framework (HKI-DDCQG), which utilizes trainable CLIP for identifying relevant diagram patches, leverages frozen vision-language models for knowledge extraction, and generates questions with trainable T5. Experiments demonstrate that HKI-DDCQG outperforms existing models on DiagramQG while maintaining strong generalizability across natural image datasets, establishing a strong baseline for DDCQG.
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
Shuyang Hao | Yiwei Wang | Bryan Hooi | Jun Liu | Muhao Chen | Zi Huang | Yujun Cai
Findings of the Association for Computational Linguistics: EMNLP 2025
Shuyang Hao | Yiwei Wang | Bryan Hooi | Jun Liu | Muhao Chen | Zi Huang | Yujun Cai
Findings of the Association for Computational Linguistics: EMNLP 2025
In the realm of large vision-language models (LVLMs), adversarial jailbreak attacks serve as a red-teaming approach to identify safety vulnerabilities of these models and their associated defense mechanisms. However, we identify a critical limitation: not every adversarial optimization step leads to a positive outcome, and indiscriminately accepting optimization results at each step may reduce the overall attack success rate. To address this challenge, we introduce HKVE (Hierarchical Key-Value Equalization), an innovative jailbreaking framework that selectively accepts gradient optimization results based on the distribution of attention scores across different layers, ensuring that every optimization step positively contributes to the attack. Extensive experiments demonstrate HKVE’s significant effectiveness, achieving attack success rates of 75.08% on MiniGPT4, 85.84% on LLaVA and 81.00% on Qwen-VL, substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively. Furthermore, making every step effective not only leads to an increase in attack success rate but also allows for a reduction in the number of iterations, thereby lowering computational costs.
Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Fangzhi Xu | Hang Yan | Chang Ma | Haiteng Zhao | Qiushi Sun | Kanzhi Cheng | Junxian He | Jun Liu | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fangzhi Xu | Hang Yan | Chang Ma | Haiteng Zhao | Qiushi Sun | Kanzhi Cheng | Junxian He | Jun Liu | Zhiyong Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. Given the input query, the LLM seeks the globally optimal response by stepwise sampling and self-rewarding, and optimizes itself with the collected responses. Genius offers some technical solutions to address the following key challenges. To tackle the problem of how to determine the steps in the response via self-rewarding, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Recognizing the intrinsic noise and uncertainty of self-supervision, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. In short, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries.
2024
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
Weiping Fu | Bifan Wei | Jianxiang Hu | Zhongmin Cai | Jun Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Weiping Fu | Bifan Wei | Jianxiang Hu | Zhongmin Cai | Jun Liu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Automatically generated questions often suffer from problems such as unclear expression or factual inaccuracies, requiring a reliable and comprehensive evaluation of their quality. Human evaluation is widely used in the field of question generation (QG) and serves as the gold standard for automatic metrics. However, there is a lack of unified human evaluation criteria, which hampers consistent and reliable evaluations of both QG models and automatic metrics. To address this, we propose **QGEval**, a multi-dimensional **Eval**uation benchmark for **Q**uestion **G**eneration, which evaluates both generated questions and existing automatic metrics across 7 dimensions: fluency, clarity, conciseness, relevance, consistency, answerability, and answer consistency. We demonstrate the appropriateness of these dimensions by examining their correlations and distinctions. Through consistent evaluations of QG models and automatic metrics with QGEval, we find that 1) most QG models perform unsatisfactorily in terms of answerability and answer consistency, and 2) existing metrics fail to align well with human judgments when evaluating generated questions across the 7 dimensions. We expect this work to foster the development of both QG technologies and their evaluation.
A Semantic Mention Graph Augmented Model for Document-Level Event Argument Extraction
Jian Zhang | Changlin Yang | Haiping Zhu | Qika Lin | Fangzhi Xu | Jun Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Jian Zhang | Changlin Yang | Haiping Zhu | Qika Lin | Fangzhi Xu | Jun Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Document-level Event Argument Extraction (DEAE) aims to identify arguments and their specific roles from an unstructured document. The advanced approaches on DEAE utilize prompt-based methods to guide pre-trained language models (PLMs) in extracting arguments from input documents. They mainly concentrate on establishing relations between triggers and entity mentions within documents, leaving two unresolved problems: a) independent modeling of entity mentions; b) document-prompt isolation. To this end, we propose a semantic mention Graph Augmented Model (GAM) to address these two problems in this paper. Firstly, GAM constructs a semantic mention graph that captures relations within and between documents and prompts, encompassing co-existence, co-reference and co-type relations. Furthermore, we introduce an ensemble graph transformer module to address mentions and their three semantic relations effectively. Later, the graph-augmented encoder-decoder module incorporates the relation-specific graph into the input embedding of PLMs and optimizes the encoder section with topology information, enhancing the relations comprehensively. Extensive experiments on the RAMS and WikiEvents datasets demonstrate the effectiveness of our approach, surpassing baseline methods and achieving a new state-of-the-art performance.
When Phrases Meet Probabilities: Enabling Open Relation Extraction with Cooperating Large Language Models
Jiaxin Wang | Lingling Zhang | Wee Sun Lee | Yujie Zhong | Liwei Kang | Jun Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaxin Wang | Lingling Zhang | Wee Sun Lee | Yujie Zhong | Liwei Kang | Jun Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Current clustering-based open relation extraction (OpenRE) methods usually apply clustering algorithms on top of pre-trained language models. However, this practice has three drawbacks. First, embeddings from language models are high-dimensional and anisotropic, so using simple metrics to calculate distances between these embeddings may not accurately reflect the relational similarity. Second, there exists a gap between the pre-trained language models and downstream clustering for their different objective forms. Third, clustering with embeddings deviates from the primary aim of relation extraction, as it does not directly obtain relations. In this work, we propose a new idea for OpenRE in the era of LLMs, that is, extracting relational phrases and directly exploiting the knowledge in LLMs to assess the semantic similarity between phrases without relying on any additional metrics. Based on this idea, we developed a framework, oreLLM, that makes two LLMs work collaboratively to achieve clustering and address the above issues. Experimental results on different datasets show that oreLLM outperforms current baselines by 1.4%∼ 3.13% in terms of clustering accuracy.
Symbol-LLM: Towards Foundational Symbol-centric Interface For Large Language Models
Fangzhi Xu | Zhiyong Wu | Qiushi Sun | Siyu Ren | Fei Yuan | Shuai Yuan | Qika Lin | Yu Qiao | Jun Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fangzhi Xu | Zhiyong Wu | Qiushi Sun | Siyu Ren | Fei Yuan | Shuai Yuan | Qika Lin | Yu Qiao | Jun Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Although Large Language Models (LLMs) demonstrate remarkable ability in processing and generating human-like text, they do have limitations when it comes to comprehending and expressing world knowledge that extends beyond the boundaries of natural language(e.g., chemical molecular formula). Injecting a collection of symbolic data directly into the training of LLMs can be problematic, as it disregards the synergies among different symbolic families and overlooks the need for a balanced mixture of natural and symbolic data. In this work, we tackle these challenges from both a data and framework perspective and introduce Symbol-LLM series models. First, we curated a data collection consisting of 34 tasks and incorporating 20 distinct symbolic families, intending to capture the interrelations and foster synergies between symbols. Then, a two-stage tuning framework succeeds in injecting symbolic knowledge without loss of the generality ability. Extensive experiments on both symbol- and NL-centric tasks demonstrate the balanced and superior performances of Symbol-LLM series models.
PathReasoner: Modeling Reasoning Path with Equivalent Extension for Logical Question Answering
Fangzhi Xu | Qika Lin | Tianzhe Zhao | JiaweiHan JiaweiHan | Jun Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fangzhi Xu | Qika Lin | Tianzhe Zhao | JiaweiHan JiaweiHan | Jun Liu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Logical reasoning task has attracted great interest since it was proposed. Faced with such a task, current competitive models, even large language models (e.g., ChatGPT and PaLM 2), still perform badly. Previous promising LMs struggle in logical consistency modeling and logical structure perception. To this end, we model the logical reasoning task by transforming each logical sample into reasoning paths and propose an architecture PathReasoner. It addresses the task from the views of both data and model. To expand the diversity of the logical samples, we propose an atom extension strategy supported by equivalent logical formulas, to form new reasoning paths. From the model perspective, we design a stack of transformer-style blocks. In particular, we propose a path-attention module to joint model in-atom and cross-atom relations with the high-order diffusion strategy. Experiments show that PathReasoner achieves competitive performances on two logical reasoning benchmarks and great generalization abilities.
2023
Enhancing Multilingual Document-Grounded Dialogue Using Cascaded Prompt-Based Post-Training Models
Jun Liu | Shuang Cheng | Zineng Zhou | Yang Gu | Jian Ye | Haiyong Luo
Proceedings of the Third DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering
Jun Liu | Shuang Cheng | Zineng Zhou | Yang Gu | Jian Ye | Haiyong Luo
Proceedings of the Third DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering
The Dialdoc23 shared task presents a Multilingual Document-Grounded Dialogue Systems (MDGDS) challenge, where system responses are generated in multiple languages using user’s queries, historical dialogue records and relevant passages. A major challenge for this task is the limited training data available in low-resource languages such as French and Vietnamese. In this paper, we propose Cascaded Prompt-based Post-training Models, dividing the task into three subtasks: Retrieval, Reranking and Generation. We conduct post-training on high-resource language such as English and Chinese to enhance performance of low-resource languages by using the similarities of languages. Additionally, we utilize the prompt method to activate model’s ability on diverse languages within the dialogue domain and explore which prompt is a good prompt. Our comprehensive experiments demonstrate the effectiveness of our proposed methods, which achieved the first place on the leaderboard with a total score of 215.40 in token-level F1, SacreBleu, and Rouge-L metrics.
Synthesize, Prompt and Transfer: Zero-shot Conversational Question Generation with Pre-trained Language Model
Hongwei Zeng | Bifan Wei | Jun Liu | Weiping Fu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hongwei Zeng | Bifan Wei | Jun Liu | Weiping Fu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Conversational question generation aims to generate questions that depend on both context and conversation history. Conventional works utilizing deep learning have shown promising results, but heavily rely on the availability of large-scale annotated conversations. In this paper, we introduce a more realistic and less explored setting, Zero-shot Conversational Question Generation (ZeroCQG), which requires no human-labeled conversations for training. To solve ZeroCQG, we propose a multi-stage knowledge transfer framework, Synthesize, Prompt, and trAnsfer with pRe-Trained lAnguage model (SPARTA) to effectively leverage knowledge from single-turn question generation instances. To validate the zero-shot performance of SPARTA, we conduct extensive experiments on three conversational datasets: CoQA, QuAC, and DoQA by transferring knowledge from three single-turn datasets: MS MARCO, NewsQA, and SQuAD. The experimental results demonstrate the superior performance of our method. Specifically, SPARTA has achieved 14.81 BLEU-4 (88.2% absolute improvement compared to T5) in CoQA with knowledge transferred from SQuAD.
TECHS: Temporal Logical Graph Networks for Explainable Extrapolation Reasoning
Qika Lin | Jun Liu | Rui Mao | Fangzhi Xu | Erik Cambria
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qika Lin | Jun Liu | Rui Mao | Fangzhi Xu | Erik Cambria
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Extrapolation reasoning on temporal knowledge graphs (TKGs) aims to forecast future facts based on past counterparts. There are two main challenges: (1) incorporating the complex information, including structural dependencies, temporal dynamics, and hidden logical rules; (2) implementing differentiable logical rule learning and reasoning for explainability. To this end, we propose an explainable extrapolation reasoning framework TEemporal logiCal grapH networkS (TECHS), which mainly contains a temporal graph encoder and a logical decoder. The former employs a graph convolutional network with temporal encoding and heterogeneous attention to embed topological structures and temporal dynamics. The latter integrates propositional reasoning and first-order reasoning by introducing a reasoning graph that iteratively expands to find the answer. A forward message-passing mechanism is also proposed to update node representations, and their propositional and first-order attention scores. Experimental results demonstrate that it outperforms state-of-the-art baselines.
2022
Inductive Relation Prediction with Logical Reasoning Using Contrastive Representations
Yudai Pan | Jun Liu | Lingling Zhang | Tianzhe Zhao | Qika Lin | Xin Hu | Qianying Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Yudai Pan | Jun Liu | Lingling Zhang | Tianzhe Zhao | Qika Lin | Xin Hu | Qianying Wang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Relation prediction in knowledge graphs (KGs) aims at predicting missing relations in incomplete triples, whereas the dominant embedding paradigm has a restriction on handling unseen entities during testing. In the real-world scenario, the inductive setting is more common because entities in the training process are finite. Previous methods capture an inductive ability by implicit logic in KGs. However, it would be challenging to preciously acquire entity-independent relational semantics of compositional logic rules and to deal with the deficient supervision of logic caused by the scarcity of relational semantics. To this end, we propose a novel graph convolutional network (GCN)-based model LogCo with logical reasoning by contrastive representations. LogCo firstly extracts enclosing subgraphs and relational paths between two entities to supply the entity-independence. Then a contrastive strategy for relational path instances and the subgraph is proposed for the issue of deficient supervision. The contrastive representations are learned for a joint training regime. Finally, prediction results and logic rules for reasoning are attained. Comprehensive experiments on twelve inductive datasets show that LogCo achieves outstanding performance comparing with state-of-the-art inductive relation prediction baselines.
MatchPrompt: Prompt-based Open Relation Extraction with Semantic Consistency Guided Clustering
Jiaxin Wang | Lingling Zhang | Jun Liu | Xi Liang | Yujie Zhong | Yaqiang Wu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Jiaxin Wang | Lingling Zhang | Jun Liu | Xi Liang | Yujie Zhong | Yaqiang Wu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Relation clustering is a general approach for open relation extraction (OpenRE). Current methods have two major problems. One is that their good performance relies on large amounts of labeled and pre-defined relational instances for pre-training, which are costly to acquire in reality. The other is that they only focus on learning a high-dimensional metric space to measure the similarity of novel relations and ignore the specific relational representations of clusters. In this work, we propose a new prompt-based framework named MatchPrompt, which can realize OpenRE with efficient knowledge transfer from only a few pre-defined relational instances as well as mine the specific meanings for cluster interpretability. To our best knowledge, we are the first to introduce a prompt-based framework for unlabeled clustering. Experimental results on different datasets show that MatchPrompt achieves the new SOTA results for OpenRE.
2021
Analyzing the Forgetting Problem in Pretrain-Finetuning of Open-domain Dialogue Response Models
Tianxing He | Jun Liu | Kyunghyun Cho | Myle Ott | Bing Liu | James Glass | Fuchun Peng
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Tianxing He | Jun Liu | Kyunghyun Cho | Myle Ott | Bing Liu | James Glass | Fuchun Peng
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
In this work, we study how the finetuning stage in the pretrain-finetune framework changes the behavior of a pretrained neural language generator. We focus on the transformer encoder-decoder model for the open-domain dialogue response generation task. Our major finding is that after standard finetuning, the model forgets some of the important language generation skills acquired during large-scale pretraining. We demonstrate the forgetting phenomenon through a set of detailed behavior analysis from the perspectives of knowledge transfer, context sensitivity, and function space projection. As a preliminary attempt to alleviate the forgetting problem, we propose an intuitive finetuning strategy named “mix-review”. We find that mix-review effectively regularizes the finetuning process, and the forgetting problem is alleviated to some extent. Finally, we discuss interesting behavior of the resulting dialogue model and its implications.
2018
Automatic Error Correction on Japanese Functional Expressions Using Character-based Neural Machine Translation
Jun Liu | Fei Cheng | Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
Jun Liu | Fei Cheng | Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
Sentence Suggestion of Japanese Functional Expressions for Chinese-speaking Learners
Jun Liu | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of ACL 2018, System Demonstrations
Jun Liu | Hiroyuki Shindo | Yuji Matsumoto
Proceedings of ACL 2018, System Demonstrations
We present a computer-assisted learning system, Jastudy, which is particularly designed for Chinese-speaking learners of Japanese as a second language (JSL) to learn Japanese functional expressions with suggestion of appropriate example sentences. The system automatically recognizes Japanese functional expressions using a free Japanese morphological analyzer MeCab, which is retrained on a new Conditional Random Fields (CRF) model. In order to select appropriate example sentences, we apply a pairwise-based machine learning tool, Support Vector Machine for Ranking (SVMrank) to estimate the complexity of the example sentences using Japanese–Chinese homographs as an important feature. In addition, we cluster the example sentences that contain Japanese functional expressions with two or more meanings and usages, based on part-of-speech, conjugation forms of verbs and semantic attributes, using the K-means clustering algorithm in Scikit-Learn. Experimental results demonstrate the effectiveness of our approach.
2017
Sentence Complexity Estimation for Chinese-speaking Learners of Japanese
Jun Liu | Yuji Matsumoto
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
Jun Liu | Yuji Matsumoto
Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation
2016
Simplification of Example Sentences for Learners of Japanese Functional Expressions
Jun Liu | Yuji Matsumoto
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)
Jun Liu | Yuji Matsumoto
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)
Learning functional expressions is one of the difficulties for language learners, since functional expressions tend to have multiple meanings and complicated usages in various situations. In this paper, we report an experiment of simplifying example sentences of Japanese functional expressions especially for Chinese-speaking learners. For this purpose, we developed “Japanese Functional Expressions List” and “Simple Japanese Replacement List”. To evaluate the method, we conduct a small-scale experiment with Chinese-speaking learners on the effectiveness of the simplified example sentences. The experimental results indicate that simplified sentences are helpful in learning Japanese functional expressions.
2010
Search
Fix author
Co-authors
- Lingling Zhang 12
- Qika Lin 8
- Fangzhi Xu 8
- Bifan Wei 6
- Xinyu Zhang 5
- Yuji Matsumoto 4
- Zhiyong Wu 4
- Yuxuan Dong 3
- Qiushi Sun 3
- Yanrui Wu 3
- Yaqiang Wu 3
- Hang Yan 3
- Haiteng Zhao 3
- Yujun Cai 2
- Kanzhi Cheng 2
- Basura Fernando 2
- Weiping Fu 2
- Yu He 2
- Muye Huang 2
- Yifei Li 2
- Pengyu Li 2
- Haoran Luo 2
- Chang Ma 2
- Rui Mao 2
- Yu Qiao 2
- Hiroyuki Shindo 2
- Yuchen Wan 2
- Jiaxin Wang 2
- Wenjun Wu 2
- Rongman Xu 2
- Jian Zhang 2
- Boxuan Zhang 2
- Tianzhe Zhao 2
- Yujie Zhong 2
- Zhongmin Cai 1
- Erik Cambria 1
- Wentong Chen 1
- Guirong Chen 1
- Muhao Chen 1
- Fei Cheng 1
- Shuang Cheng 1
- Kyunghyun Cho 1
- Junbo Cui 1
- Junjie Fang 1
- Honghao Fu 1
- Yumeng Fu 1
- Zhitao Gao 1
- James Glass 1
- Yang Gu 1
- Weidong Guo 1
- Shuyang Hao 1
- Wei He 1
- Tianxing He 1
- Junxian He 1
- Bryan Hooi 1
- Min Hou 1
- Jianxiang Hu 1
- Jinyi Hu 1
- Xin Hu 1
- Jiaxing Huang 1
- Zi Huang 1
- Yupeng Huo 1
- Chengyou Jia 1
- JiaweiHan JiaweiHan 1
- Liwei Kang 1
- Han Lai 1
- Wee Sun Lee 1
- Ming Li 1
- Bo Li 1
- Xi Liang 1
- Yankai Lin (林衍凯) 1
- Hui Liu 1
- Zhiyuan Liu 1
- Huan Liu 1
- Bing Liu 1
- Haiyong Luo 1
- Shaoxuan Ma 1
- Xiangdi Meng 1
- Myle Ott 1
- Yudai Pan 1
- Fuchun Peng 1
- Tianyu Qi 1
- Yujia Qin 1
- Siyu Ren 1
- Mike Zheng Shou 1
- Maosong Sun (孙茂松) 1
- Yonglin Teng 1
- Zeao Tu 1
- Luu Anh Tuan 1
- Yiwei Wang 1
- Yan Wang 1
- Yiran Wang 1
- Chongyi Wang 1
- Qianying Wang 1
- Shaowei Wang 1
- Zhiyuan Wang 1
- Zhangqi Wang 1
- Yiwei Wang 1
- Xiaobao Wu 1
- Jiyuan Wu 1
- Lijiao Xu 1
- Yu Xu 1
- Miao Xu 1
- Changlin Yang 1
- Zesheng Yang 1
- Zihan Yao 1
- Yuan Yao 1
- YiXing Yao 1
- Jian Ye 1
- Fei Yuan 1
- Shuai Yuan 1
- Li Yuan 1
- Hongwei Zeng 1
- Dailing Zhang 1
- Jian Zhang 1
- Yushun Zhang 1
- Yue Zhao 1
- Bo Zhao 1
- Zineng Zhou 1
- Haiping Zhu 1
- Jiayin Zhu 1
- Yu Zou (邹煜) 1