Jiaqi Li
Other people with similar names: Jiaqi Li, Jiaqi Li
Unverified author pages with similar names: Jiaqi Li
2026
Automatic and Reliable Evaluation for Academic Caption-to-Figure Generation with LMMs
Guanghui Ye | Huan Zhao | Qin Zhu | Fengnan Li | Jiaqi Li | Yixian Shen | Zhonghao Ren | Zhihua Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Guanghui Ye | Huan Zhao | Qin Zhu | Fengnan Li | Jiaqi Li | Yixian Shen | Zhonghao Ren | Zhihua Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Existing datasets for evaluating text-to-image generation focus mostly on real-life images, which poses challenges for assessing academicfigure generation given real scientific captions, which is a hot topic in AI for Science. To fill the gap, we propose HE4AFG, a novel datasetwhich first provides a Holistic Evaluation for Academic caption-to-Figure Generation (AFG). Specifically, HE4AFG collects real figure captions from 8 scientific domains and finally generates 3,900 evaluation samples (particularly, including multi-panel figures) using 5 mainstream large multimodal models (LMMs). For each sample, we provide high-quality human ratings in terms of three aspects—scientific aesthetic (SA), topic relevance (TR), and attribute correctness (AC). Moreover, we present two trainable models: (1) HE4AFG-E, an automated Evaluation model for AFG, which generates aspect-aware training examples and then use them to train three aspect-specific evaluation modules via contrastive learning; (2) HE4AFG-R, an automated Refinement model, which generates and utilizes feedback on the quality of the figures (e.g., unfaithful elements) to continuously improve AFG. Extensive experiments on HE4AFG demonstrate the effectiveness and performance advantages of our models.
Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization
Jiaqi Li | Guangming Wang | Shuntian Zheng | Minzhe Ni | Xiaoman Lu | Guanghui Ye | Yu Guan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaqi Li | Guangming Wang | Shuntian Zheng | Minzhe Ni | Xiaoman Lu | Guanghui Ye | Yu Guan
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage—the incremental benefit of language over vision-only predictions—and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP. Our code is available at https://github.com/JiaqiLi404/ActionVLM
2025
Know the Unknown: An Uncertainty-Sensitive Method for LLM Instruction Tuning
Jiaqi Li | Yixuan Tang | Yi Yang
Findings of the Association for Computational Linguistics: ACL 2025
Jiaqi Li | Yixuan Tang | Yi Yang
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) demonstrate remarkable capabilities but face challenges from hallucinations, which typically arise from insufficient knowledge or context. While instructing LLMs to acknowledge knowledge limitations by responding with “I don’t know” appears promising, we find that models consistently struggle with admitting knowledge gaps. This challenge may originate from current instruction datasets that emphasise answer generation over knowledge boundary awareness. To address this limitation, we introduce **U**ncertainty-and-**S**ensitivity-Aware Tuning **(US-Tuning)**, a novel two-stage approach for contextual question answering (QA). The first stage enhances LLMs’ ability to recognise their knowledge boundaries, while the second stage reinforces instruction adherence through carefully designed causal prompts. Our experimental results demonstrate that US-Tuning not only significantly reduces incorrect answers in contextual QA but also improves models’ faithfulness to their parametric knowledge, mitigating hallucinations in general QA tasks. Our fine-tuned Llama2-7B model achieves up to a 34.7% improvement in handling out-of-knowledge questions and outperforms GPT-4 by 4.2% in overall performance.