Zizhen Wang
2026
Putting Captions to the Test: Evaluating Video Caption Quality through Multiple-Choice Question Answering
Zizhen Wang | Bo Feng | Zhengfeng Lai | Shiyu Li | Yang Lu | Meng Cao | Ping Huang | Xiaoming Simon Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zizhen Wang | Bo Feng | Zhengfeng Lai | Shiyu Li | Yang Lu | Meng Cao | Ping Huang | Xiaoming Simon Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating video captioning remains a critical challenge for Visual Large Language Models (VLLMs). Existing metrics primarily rely on matching generated text against ground-truth references. This paradigm suffers from the “one-to-many” nature of video description, where high-quality captions are often penalized for lexical mismatches or valid shifts in visual focus. Furthermore, such assessments are typically one-dimensional, failing to provide a fine-grained analysis of caption quality. To address this, we redefine caption quality through the lens of information fidelity: A caption must maximize the coverage of salient visual information while ensuring strict factuality. We introduce CapQuiz, a novel reference-free benchmark that assesses captions based on their utility in answering human-verified, fine-grained, multiple-choice questions derived from the video. CapQuiz features a hierarchical taxonomy of 10 question types (spanning Descriptive and Inferential categories) across 24 diverse video domains. Extensive experiments demonstrate that CapQuiz correlates significantly better with human judgments than existing metrics and offers interpretable insights into model performance. We will release the benchmark to facilitate reproducible research.
2022
The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking
Yinghui Li | Qingyu Zhou | Yangning Li | Zhongli Li | Ruiyang Liu | Rongyi Sun | Zizhen Wang | Chao Li | Yunbo Cao | Hai-Tao Zheng
Findings of the Association for Computational Linguistics: ACL 2022
Yinghui Li | Qingyu Zhou | Yangning Li | Zhongli Li | Ruiyang Liu | Rongyi Sun | Zizhen Wang | Chao Li | Yunbo Cao | Hai-Tao Zheng
Findings of the Association for Computational Linguistics: ACL 2022
Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors, which are mainly caused by the phonological or visual similarity. Recently, pre-trained language models (PLMs) promote the progress of CSC task. However, there exists a gap between the learned knowledge of PLMs and the goal of CSC task. PLMs focus on the semantics in text and tend to correct the erroneous characters to semantically proper or commonly used ones, but these aren’t the ground-truth corrections. To address this issue, we propose an Error-driven COntrastive Probability Optimization (ECOPO) framework for CSC task. ECOPO refines the knowledge representations of PLMs, and guides the model to avoid predicting these common characters through an error-driven way. Particularly, ECOPO is model-agnostic and it can be combined with existing CSC methods to achieve better performance. Extensive experiments and detailed analyses on SIGHAN datasets demonstrate that ECOPO is simple yet effective.
DialogUSR: Complex Dialogue Utterance Splitting and Reformulation for Multiple Intent Detection
Haoran Meng | Xin Zheng | Tianyu Liu | Zizhen Wang | He Feng | Binghuai Lin | Xuemin Zhao | Yunbo Cao | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2022
Haoran Meng | Xin Zheng | Tianyu Liu | Zizhen Wang | He Feng | Binghuai Lin | Xuemin Zhao | Yunbo Cao | Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2022
While interacting with chatbots, users may elicit multiple intents in a single dialogue utterance. Instead of training a dedicated multi-intent detection model, we propose DialogUSR, a dialogue utterance splitting and reformulation task that first splits multi-intent user query into several single-intent sub-queries and then recovers all the coreferred and omitted information in the sub-queries. DialogUSR can serve as a plug-in and domain-agnostic module that empowers the multi-intent detection for the deployed chatbots with minimal efforts. We collect a high-quality naturally occurring dataset that covers 23 domains with a multi-step crowd-souring procedure. To benchmark the proposed dataset, we propose multiple action-based generative models that involve end-to-end and two-stage training, and conduct in-depth analyses on the pros and cons of the proposed baselines.