Yueqin Yin

2026

Large language models often hallucinate, producing content that is factually incorrect or not grounded in the sources. Reliable faithfulness verification is critical for trustworthy deployment. In the provided-source (closed-world) setting, existing verifiers either classify whole passages in one step or check sentences independently, overlooking cross-sentence context. We present ContextCheck, a framework for sentence-level faithfulness verification with context-aware disambiguation. Each sentence is verified against the grounding document while conditioning on preceding sentences, enabling pronouns and references to be resolved directly in context. This design avoids the separate decontextualization step of rewriting claims into self-contained forms, casting verification as a context-conditioned task. Fine-tuned from Llama-3.1-8B-Instruct, ContextCheck sets a new state of the art on three context-dependent datasets; it improves Macro F1 by over 10 points compared to the strongest baselines, and matches or slightly surpasses the strongest baselines on 14 standard single-sentence datasets compared to prior 8B-scale verifiers (average Macro F1 73.5 vs. 72.8). These results show that ContextCheck offers a practical and effective approach for sentence-level hallucination detection.

2025

pdf bib abs

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Zhangchen Xu | Yang Liu | Yueqin Yin | Mingyuan Zhou | Radha Poovendran
Findings of the Association for Computational Linguistics: ACL 2025

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question–solution–test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. It is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.

Co-authors

Xin Liu 1

Venues

Findings2

Fix author