2025
pdf
bib
abs
Visual Evidence Prompting Mitigates Hallucinations in Large Vision-Language Models
Wei Li
|
Zhen Huang
|
Houqiang Li
|
Le Lu
|
Yang Lu
|
Xinmei Tian
|
Xu Shen
|
Jieping Ye
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Vision-Language Models (LVLMs) have shown impressive progress by integrating visual perception with linguistic understanding to produce contextually grounded outputs. Despite these advancements achieved, LVLMs still suffer from the hallucination problem, e.g., they tend to produce content that does not exist in the input images. Our investigation suggests that such hallucinations often stem from the deficiencies in fine-grained comprehension on the visual aspect, particularly when visual scenes exhibit appearance or semantic similarities (e.g., bicycle vs. motorcycles, baseball bat vs. baseball). In this work, we show such hallucination is naturally mitigated via a novel method called visual evidence prompting, utilizing small visual models to complement the LVLMs. While traditional visual models are not adept at interacting with humans, they excel at perceiving the fine-grained image contents. By symbolizing the professional outputs of domain-expert models as prompts, the LVLM generalists are able to refer to these evidences as visual knowledge to generate more precise answers. Detailed analysis shows that visual evidence enables models to adjust and rectify the attribution and attention on the images, reducing visual confusion by suppressing false activation while enhancing correct ones. Extensive experiments and in-depth analysis demonstrate the effectiveness of our method. We hope our straightforward but insightful work enhances the comprehension of hallucination in LVLMs and offers valuable perspectives on addressing such challenges.
pdf
bib
abs
YuLan-Mini: Pushing the Limits of Open Data-efficient Language Model
Hu Yiwen
|
Huatong Song
|
Jie Chen
|
Jia Deng
|
Jiapeng Wang
|
Kun Zhou
|
Yutao Zhu
|
Jinhao Jiang
|
Zican Dong
|
Yang Lu
|
Xu Miao
|
Xin Zhao
|
Ji-Rong Wen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Due to the immense resource demands and the involved complex techniques, it is still challenging for successfully pre-training a large language models (LLMs) with state-of-the-art performance. In this paper, we explore the key bottlenecks and designs during pre-training, and make the following contributions: (1) a comprehensive investigation into the factors contributing to training instability; (2) a robust optimization approach designed to mitigate training instability effectively; (3) an elaborate data pipeline that integrates data synthesis, data curriculum, and data selection. By integrating the above techniques, we create a rather low-cost training recipe and use it to pre-train YuLan-Mini, a fully-open base model with 2.4B parameters on 1.08T tokens. Remarkably, YuLan-Mini achieves top-tier performance among models of similar parameter scale, with comparable performance to industry-leading models that require significantly more data. To facilitate reproduction, we release the full details of training recipe and data composition. Project details can be accessed at the following link: https://anonymous.4open.science/r/YuLan-Mini/README.md.
pdf
bib
abs
Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses
Subin An
|
Yugyeong Ji
|
Junyoung Kim
|
Heejin Kook
|
Yang Lu
|
Josh Seltzer
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions—effort, relevance, and complete- ness—are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.