Haotian Wang

Papers on this page may belong to the following people: Haotian Wang, Haotian Wang

2026

Conflict-Aware Memory for Embodied Agents: Enhancing Vector Data Quality via Detection Rules
Kexin Ma | Haotian Wang | Shenglin Chen | Yishuai Cai | Huangyuyu | Ruochun Jin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Embodied agents have successfully leveraged large language models (LLMs) to better transform human instructions and images into executable task plans. Furthermore, memories of agents can be leveraged to achieve continual self-learning and optimization. However, vector data quality problems emerge in memories when they are projected into vector space, especially in discerning contextually similar but semantically conflicting sentences and highly similar images. This is particularly detrimental to embodied AI as it potentially distorts the robot’s actions. To address this challenge, we propose Conflict Detection Rules (CDRs) to identify and manage data quality issues in vector knowledge bases, which assist in correcting the index structure and further improving the answer quality. Experimental results show that planners with CDRs exceed the basic LLM planner by 15.25% and 14.25% in grammatical accuracy (GA) and interpretation accuracy (IA) on average, respectively. Moreover, the entire workflow has been successfully integrated into various scenarios, demonstrating its practical applicability and robustness in the real world.

2025

pdf bib abs

CoAlign: Uncertainty Calibration of LLM for Geospatial Repartition
Zejun Xie | Zhiqing Hong | Wenjun Lyu | Haotian Wang | Guang Wang | Desheng Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

With the rapid expansion of e-commerce and continuous urban evolution, Geospatial Repartition, dividing geographical regions into delivery zones, is essential to optimize various objectives, e.g., on-time delivery rate, for last-mile delivery. Recently, large language models (LLMs) have offered promising capabilities for integrating diverse contextual information that is beneficial for geospatial repartition. However, given the inherent uncertainty in LLMs, adapting them to practical usage in real-world repartition is nontrivial. Thus, we introduce CoAlign, a novel three-stage framework that calibrates LLM uncertainty to enable robust geospatial repartition by transforming the task into a ranking problem, integrating historical data with LLM-generated candidates. It first generates explainable candidate partitions with a multi-criteria strategy and then designs a novel conformal method to rank these candidates relative to historical partitions with coverage guarantees. Finally, CoAlign delivers candidates through an interactive decision support system. Extensive evaluation with real-world data shows that CoAlign effectively calibrates LLM uncertainty and generates partitions that better align with human feedback. Moreover, we have deployed CoAlign in one of the world’s largest logistics companies, significantly enhancing their delivery operations by increasing candidate acceptance rates by 300% and improving on-time delivery rates by 3%. Our work provides a novel angle to address industrial geospatial decision-making tasks by calibrating LLM uncertainty.

pdf bib abs

Reinforcement Learning from Human Feedback (RLHF) has been shown to effectively align large language models (LLMs) with human knowledge. However, the lack of human preference labels remains a significant bottleneck when applying RLHF to a downstream domain. Humans in RLHF play a critical role in injecting reasoning preferences into LLM, and we assume the reasoning process underlying human assessments may potentially be replaced by reasoning pathways derived from Knowledge Graphs (KGs). Inspired by this assumption, we propose Reinforcement Learning from Knowledge Graph Feedback (RLKGF), a novel method that leverages KG semantics and structure to derive RL rewards in the absence of manual annotations. Unlike Reinforcement Learning from AI Feedback (RLAIF), RLKGF directly integrates human priors encoded in KGs as the reward model, aligning LLM responses with expert knowledge without additional preference labeling or reward model training. RLKGF structures context-relevant facts into knowledge subgraphs and defines rewards by simulating information flow across semantic and logical connections between question and candidate response entities. Experiments on three public and one private medical dialogue dataset demonstrate that RLKGF significantly outperforms the competitive RLAIF in improving LLM diagnostic accuracy. The code is available at https://github.com/YanPioneer/RLKGF.

pdf bib abs

Agri-CM³: A Chinese Massive Multi-modal, Multi-level Benchmark for Agricultural Understanding and Reasoning
Haotian Wang | Yi Guan | Fanshu Meng | Chao Zhao | Lian Yan | Yang Yang | Jingchi Jiang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multi-modal Large Language Models (MLLMs) integrating images, text, and speech can provide farmers with accurate diagnoses and treatment of pests and diseases, enhancing agricultural efficiency and sustainability. However, existing benchmarks lack comprehensive evaluations, particularly in multi-level reasoning, making it challenging to identify model limitations. To address this issue, we introduce Agri-CM³, an expert-validated benchmark assessing MLLMs’ understanding and reasoning in agricultural management. It includes 3,939 images and 15,901 multi-level multiple-choice questions with detailed explanations. Evaluations of 45 MLLMs reveal significant gaps. Even GPT-4o achieves only 63.64% accuracy, falling short in fine-grained reasoning tasks. Analysis across three reasoning levels and seven compositional abilities highlights key challenges in accuracy and cognitive understanding. Our study provides insights for advancing MLLMs in agricultural management, driving their development and application. Code and data are available at https://github.com/HIT-Kwoo/Agri-CM3.

2024

pdf bib abs

Large language models (LLMs) have demonstrated strong reasoning capabilities.Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks.Retrieval-augmented reasoning represents a promising approach.However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge.To address this, we propose Beam Aggregation Reasoning (BeamAggR), a reasoning framework for knowledge-intensive multi-hop QA.BeamAggR explores and prioritizes promising answers at each hop of question.Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning.For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates.For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory.Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%.Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.

pdf bib abs

TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
Zheng Chu | Jingchang Chen | Qianglong Chen | Weijiang Yu | Haotian Wang | Ming Liu | Bing Qin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world.Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark.To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena.TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models.We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings.Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning.Besides, LLMs exhibit capability discrepancies across different reasoning categories.Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges.We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning.Code and data are available at https://github.com/zchuz/TimeBench.

pdf bib abs

Reasoning, a fundamental cognitive process integral to human intelligence, has garnered substantial interest within artificial intelligence.Notably, recent studies have revealed that chain-of-thought prompting significantly enhances LLM’s reasoning capabilities, which attracts widespread attention from both academics and industry.In this paper, we systematically investigate relevant research, summarizing advanced methods through a meticulous taxonomy that offers novel perspectives.Moreover, we delve into the current frontiers and delineate the challenges and future directions, thereby shedding light on future research.Furthermore, we engage in a discussion about open questions.We hope this paper serves as an introduction for beginners and fosters future research.Resources have been made publicly available at https://github.com/zchuz/CoT-Reasoning-Survey

pdf bib abs

Retrieval-augmented generation integrates the capabilities of large language models with relevant information retrieved from an extensive corpus, yet encounters challenges when confronted with real-world noisy data. One recent solution is to train a filter module to find relevant content but only achieve suboptimal noise compression. In this paper, we propose to introduce the information bottleneck theory into retrieval-augmented generation. Our approach involves the filtration of noise by simultaneously maximizing the mutual information between compression and ground output, while minimizing the mutual information between compression and retrieved passage. In addition, we derive the formula of information bottleneck to facilitate its application in novel comprehensive evaluations, the selection of supervised fine-tuning data, and the construction of reinforcement learning rewards. Experimental results demonstrate that our approach achieves significant improvements across various question answering datasets, not only in terms of the correctness of answer generation but also in the conciseness with 2.5% compression rate.