Yishuo Cai


2025

pdf bib
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Yuchi Wang | Yishuo Cai | Shuhuai Ren | Sihan Yang | Linli Yao | Yuanxin Liu | Yuanxing Zhang | Pengfei Wan | Xu Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap.

pdf bib
MHALO: Evaluating MLLMs as Fine-grained Hallucination Detectors
Yishuo Cai | Renjie Gu | Jiaxu Li | Xuancheng Huang | Junzhe Chen | Xiaotao Gu | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025

Hallucination remains a critical challenge for multimodal large language models (MLLMs), undermining their reliability in real-world applications. While fine-grained hallucination detection (FHD) holds promise for enhancing high-quality vision-language data construction and model alignment through enriched feedback signals, automated solutions for this task have yet to be systematically explored. Inspired by the concept of “MLLM as a Judge”, we introduce MHALO, the first comprehensive benchmark specifically designed for evaluating MLLMs’ capability in performing token-level FHD. Our benchmark encompasses 12 distinct hallucination types spanning both multimodal perception and reasoning domains. Through extensive evaluations of 9 selected MLLMs, we reveal substantial performance limitations, with the leading model achieving an average F1IoU of only 40.59%. To address this limitation, we develop HaloDet-4B, a specialized model trained on our curated training data, which significantly outperforms existing models. We hope the benchmark can provide valuable insights for future research on hallucination mitigation in MLLMs. The code and dataset will be publicly available.

2024

pdf bib
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu | Yishuo Cai | Zhenhong Zhou | Renjie Gu | Haiqin Weng | Liu Yan | Tianwei Zhang | Wei Xu | Han Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track

The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, , the model can steer away from generating harmful content autonomously. First, we introduce the C2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction.To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C2-Syn, a synthetic C2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.Experiments on Llama2-Chat 7B and Qwen2 7B show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’ safety, particularly in resisting jailbreak attacks.