Jiangning Chen

2025

pdf bib abs
Hallucination Detection in Structured Query Generation via LLM Self-Debating
Miaoran Li | Jiangning Chen | Minghua Xu | Xiaolong Wang
Findings of the Association for Computational Linguistics: EMNLP 2025

Hallucination remains a key challenge in applying large language models (LLMs) to structured query generation, especially for semi-private or domain-specific languages underrepresented in public training data. In this work, we focus on hallucination detection in these low-resource structured language scenarios, using Splunk Search Processing Language (SPL) as a representative case study. We start from analyzing real-world SPL generation to define hallucination in this context and introduce a comprehensive taxonomy. To enhance detection performance, we propose the Self-Debating framework, which prompts an LLM to generate contrastive explanations from opposing perspectives before rendering a final consistency judgment. We also construct a synthetic benchmark, SynSPL, to support systematic evaluation of hallucination detection in SPL generation. Experimental results show that Self-Debating consistently outperforms LLM-as-a-Judge baselines with zero-shot and chain-of-thought (CoT) prompts in SPL hallucination detection across different LLMs, yielding 5–10% relative gains in hallucination F1 scores on both real and synthetic datasets, and up to 260% improvement for LLaMA-3.1–8B. Besides hallucination detection on SPL, Self-Debating also achieves excellent performance on the FaithBench benchmark for summarization hallucination, demonstrating the strong generalization ability of Self-Debating, with OpenAI o1-mini achieving state-of-the-art performance. All these results consistently demonstrate the strong robustness and wide generalizability of Self-Debating.

2024

pdf bib abs
Optimizing Entity Resolution in Voice Interfaces: An ASR-Aware Entity Reference Expansion Approach
Jiangning Chen | Ziyun Zhang | Qianli Hu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

This paper tackles the challenges presented by Automatic Speech Recognition (ASR) errors in voice-based dialog systems, specifically, their adverse impact on Entity Resolution (ER) as a downstream task. Navigating the equilibrium between accuracy and online retrieval’s speed requirement proves challenging, particularly when limited data links the failed mentions to resolved entities. In this paper, we propose a entity reference expansion system, injecting pairs of failed mentions and resolved entity names into the knowledge graph, enhancing its awareness of unresolved mentions. To address data scarcity, we introduce a synthetic data generation approach aligned with noise patterns. This, combined with an ASR-Error-Aware Loss function, facilitates the training of a RoBERTa model, which filters failed mentions and extracts entity pairs for knowledge graph expansion. These designs confront obstacles related to ASR noise, data limitations, and online entity retrieval.

2021

The growing popularity of Virtual Assistants poses new challenges for Entity Resolution, the task of linking mentions in text to their referent entities in a knowledge base. Specifically, in the shopping domain, customers tend to mention the entities implicitly (e.g., “organic milk”) rather than use the entity names explicitly, leading to a large number of candidate products. Meanwhile, for the same query, different customers may expect different results. For example, with “add milk to my cart”, a customer may refer to a certain product from his/her favorite brand, while some customers may want to re-order products they regularly purchase. Moreover, new customers may lack persistent shopping history, which requires us to enrich the connections between customers through products and their attributes. To address these issues, we propose a new framework that leverages personalized features to improve the accuracy of product ranking. We first build a cross-source heterogeneous knowledge graph from customer purchase history and product knowledge graph to jointly learn customer and product embeddings. After that, we incorporate product, customer, and history representations into a neural reranking model to predict which candidate is most likely to be purchased by a specific customer. Experiment results show that our model substantially improves the accuracy of the top ranked candidates by 24.6% compared to the state-of-the-art product search model.

In dialog systems, the Natural Language Understanding (NLU) component typically makes the interpretation decision (including domain, intent and slots) for an utterance before the mentioned entities are resolved. This may result in intent classification and slot tagging errors. In this work, we propose to leverage Entity Resolution (ER) features in NLU reranking and introduce a novel loss term based on ER signals to better learn model weights in the reranking framework. In addition, for a multi-domain dialog scenario, we propose a score distribution matching method to ensure scores generated by the NLU reranking models for different domains are properly calibrated. In offline experiments, we demonstrate our proposed approach significantly outperforms the baseline model on both single-domain and cross-domain evaluations.

In recent years, incorporating external knowledge for response generation in open-domain conversation systems has attracted great interest. To improve the relevancy of retrieved knowledge, we propose a neural entity linking (NEL) approach. Different from formal documents, such as news, conversational utterances are informal and multi-turn, which makes it more challenging to disambiguate the entities. Therefore, we present a context-aware named entity recognition model (NER) and entity resolution (ER) model to utilize dialogue context information. We conduct NEL experiments on three open-domain conversation datasets and validate that incorporating context information improves the performance of NER and ER models. The end-to-end NEL approach outperforms the baseline by 62.8% relatively in F1 metric. Furthermore, we verify that using external knowledge based on NEL benefits the neural response generation model.