Zhijian Xu

2025

pdf bib abs
Guideline Compliance in Task-Oriented Dialogue: The Chained Prior Approach
Xiangyu Wen | Jianyuan Zhong | Zhijian Xu | Qiang Xu
Findings of the Association for Computational Linguistics: NAACL 2025

Task-oriented dialogue (TOD) systems are widely used across various domains, including customer service, appointment scheduling, and technical support. In real-world scenarios, such systems must adhere to given operational guidelines. However, existing solutions based on large language models often cannot achieve strict guideline compliance, even when fine-tuned with domain knowledge. To address this issue, we introduce a novel TOD system named GuidedTOD, which explicitly considers domain-specific guidelines by integrating a policy module. This module employs a Markov Chain, termed Chained Prior, to efficiently encode and dynamically update guideline knowledge. During inference, the Chained Prior re-ranks outputs from the domain-expert language model using beam search, ensuring guideline adherence. Experimental results show that GuidedTOD significantly improves guideline compliance, achieving approximately 20% better action prediction accuracy than state-of-the-art solutions. Code is available here: https://github.com/cure-lab/GuidedTOD.

2024

pdf bib abs
Revisiting Automated Evaluation for Long-form Table Question Answering
Yuqi Wang | Lyuhao Chen | Songcheng Cai | Zhijian Xu | Yilun Zhao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In the era of data-driven decision-making, Long-Form Table Question Answering (LFTQA) is essential for integrating structured data with complex reasoning. Despite recent advancements in Large Language Models (LLMs) for LFTQA, evaluating their effectiveness remains a significant challenge. We introduce LFTQA-Eval, a meta-evaluation dataset comprising 2,988 human-annotated examples, to rigorously assess the efficacy of current automated metrics in assessing LLM-based LFTQA systems, with a focus on faithfulness and comprehensiveness. Our findings reveal that existing automatic metrics poorly correlate with human judgments and fail to consistently differentiate between factually accurate responses and those that are coherent but factually incorrect. Additionally, our in-depth examination of the limitations associated with automated evaluation methods provides essential insights for the improvement of LFTQA automated evaluation.

Table data is pervasive in various industries, and its comprehension and manipulation demand significant time and effort for users seeking to extract relevant information. Consequently, an increasing number of studies have been directed towards table-to-text generation tasks. However, most existing methods are benchmarked solely on a limited number of datasets with varying configurations, leading to a lack of unified, standardized, fair, and comprehensive comparison between methods. This paper presents OpenT2T, the first open-source toolkit for table-to-text generation, designed to reproduce existing large language models (LLMs) for performance comparison and expedite the development of new models.We have implemented and compared a wide range of LLMs under zero- and few-shot settings on 9 table-to-text generation datasets, covering data insight generation, table summarization, and free-form table question answering. Additionally, we maintain a public leaderboard to provide insights for future work into how to choose appropriate table-to-text generation systems for real-world scenarios.

Co-authors

Venues

emnlp2
findings1

Fix data