Yukun Huang

2026

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality
Yukun Huang | Leonardo F. R. Ribeiro | Momchil Hardalov | Bhuwan Dhingra | Markus Dreyer | Venkatesh Saligrama
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers usually target general-domain atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs.Yet building such a benchmark for DRR fact-checkers is itself difficult because it requires expert judgments over cognitively demanding, domain-specific claims.In a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on hidden known-answer claims. We therefore propose evolving benchmarking via **Audit-then-Score** (**AtS**), in which labels and rationales remain revisable: when a verifier disagrees with the current benchmark, it submits evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before scoring. After three additional **AtS** rounds, expert accuracy rises to 90.9%, showing that experts are better auditors than one-shot labelers.We instantiate **AtS** as **DeepFactBench**, a versioned DRR factuality benchmark with auditable rationales, and introduce **DeepFactEval**, a claim-level verifier.On the frozen **DeepFactBench** release, **DeepFactEval** achieves 83.4% accuracy, outperforming the best prior deep-research and traditional fact-checkers by 14.3 and 24.9 points, respectively, and transferring well to external factuality datasets.

2025

pdf bib abs

Learning and Evaluating Factual Clarification Question Generation Without Examples
Matthew Toles | Yukun Huang | Zhou Yu
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

Real-world tasks such as giving legal or technical advice often depend on context that is initially missing at the outset. The ability to derive missing factual information by asking clarifying questions (ACQ) is an important element of real-life collaboration on such reasoning tasks. Although intent disambiguation has been heavily investigated, factual reasoning remains underexplored. To enable evaluation of factual domain clarification question generation, we present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. We observe that humans outperform GPT-4o by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics. Finally, we find that by fine-tuning Llama 3 8B Instruct on its own generations filtered via rejection sampling, we can improve information recovery by 27.6% without using any manually labeled data.

pdf bib abs

Fuzzy Speculative Decoding for a Tunable Accuracy-Runtime Tradeoff
Maximilian Holsman | Yukun Huang | Bhuwan Dhingra
Findings of the Association for Computational Linguistics: ACL 2025

Speculative Decoding (SD) enforces strict distributional equivalence to the target model when accepting candidate tokens. While it maintains the target model’s generation quality, this strict equivalence limits the speedup achievable by SD and prevents users from trading deviations from the target distribution in exchange for further inference speed gains. To address these limitations, we introduce Fuzzy Speculative Decoding (FSD) - a decoding algorithm that generalizes SD by accepting candidate tokens based on the divergences between the target and draft model distributions. By allowing for controlled divergence from the target model, FSD enables users to flexibly trade generation quality for inference speed. Across several benchmarks, our method is able to achieve significant runtime improvements of over 5 tokens per second faster than SD at only an approximate 2% absolute reduction in benchmark accuracy. In many cases, FSD is even able to match SD benchmark accuracy at over 2 tokens per second faster, demonstrating that distributional equivalence is not necessary to maintain target model performance. Furthermore, FSD can be seamlessly integrated into existing SD extensions; we demonstrate this by applying FSD to EAGLE-2, greatly enhancing this existing extension’s efficiency while allowing it to leverage FSD’s tunable quality-speed trade-off.

pdf bib abs

Real-time Factuality Assessment from Adversarial Feedback
Sanxing Chen | Yukun Huang | Bhuwan Dhingra
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We show that existing evaluations for assessing the factuality of news from conventional sources, such as claims on fact-checking websites, result in high accuracies over time for LLM-based detectors—even after their knowledge cutoffs. This suggests that recent popular false information from such sources can be easily identified due to its likely presence in pre-training/retrieval corpora or the emergence of salient, yet shallow, patterns in these datasets. Instead, we argue that a proper factuality evaluation dataset should test a model’s ability to reason about current events by retrieving and reading related evidence. To this end, we develop a novel pipeline that leverages natural language feedback from a RAG-based detector to iteratively modify real-time news into deceptive variants that challenge LLMs. Our iterative rewrite decreases the binary classification ROC-AUC by an absolute 17.5 percent for a strong RAG-based GPT-4o detector. Our experiments reveal the important role of RAG in both evaluating and generating challenging news examples, as retrieval-free LLM detectors are vulnerable to unseen events and adversarial attacks, while feedback from RAG-based evaluation helps discover more deceitful patterns.

2024

pdf bib abs

Calibrating Long-form Generations From Large Language Models
Yukun Huang | Yixin Liu | Raghuveer Thirukovalluru | Arman Cohan | Bhuwan Dhingra
Findings of the Association for Computational Linguistics: EMNLP 2024

To enhance Large Language Models’ (LLMs) reliability, calibration is essential—the model’s confidence scores should align with the likelihood of its responses being correct. However, traditional calibration methods typically rely on a binary true/false assessment of response correctness, unsuitable for long-form generations where an answer can be partially correct. Addressing this gap, we introduce a unified calibration framework, in which both the correctness of the LLMs’ responses and their associated confidence levels are treated as distributions across a range of scores. We develop three metrics for assessing LLM calibration and propose confidence elicitation methods based on self-consistency and self-evaluation. Our experiments demonstrate that larger models don’t necessarily guarantee better calibration, that various calibration metrics complement each other, and that self-consistency methods excel in factoid datasets. We also find that calibration can be enhanced through techniques such as fine-tuning, scaling the temperature. Finally, we illustrate one application of long-form calibration through selective answering in long-form responses, optimizing correctness within a constrained API budget.

pdf bib abs

Atomic Self-Consistency for Better Long Form Generations
Raghuveer Thirukovalluru | Yukun Huang | Bhuwan Dhingra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question. In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response. ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama3. Our analysis also reveals untapped potential for enhancing long-form generations using the approach of merging multiple samples.

Yukun Huang

2026

2025

2024

2023

Co-authors

Venues