Jiaxin Ding
2026
VisPCO: Visual Token Pruning Configuration Optimization via Budget-Aware Pareto-Frontier Learning for Vision-Language Models
Huawei Ji | Yuanhao Sun | Yuan Jin | Cheng Deng | Jiaxin Ding | Luoyi Fu | Xinbing Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Huawei Ji | Yuanhao Sun | Yuan Jin | Cheng Deng | Jiaxin Ding | Luoyi Fu | Xinbing Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Visual token pruning methods effectively mitigate the quadratic computational growth caused by processing high-resolution images and video frames in vision-language models (VLMs). However, existing approaches rely on predefined pruning configurations without determining whether they achieve computation-performance optimality. In this work, we introduce , a novel framework that formulates visual token pruning as a Pareto configuration optimization problem to automatically identify optimal configurations. Our approach employs continuous relaxation and straight-through estimators to enable gradient-based search, solved via the Augmented Lagrangian method. Extensive experiments across 8 visual benchmarks demonstrate that effectively approximates the empirical Pareto frontier obtained through grid search and generalizes well across various pruning methods and VLM architectures. Furthermore, through learnable kernel functions, we investigate layer-wise pruning patterns and reveal that multi-step progressive pruning captures VLMs’ hierarchical compression structure, achieving superior accuracy-efficiency trade-offs compared to single-layer approaches.
GR1: Reinforcement-Enhanced LLM for Geoscience Reasoning
Yule Xie | Jiaxin Ding | Cheng Deng | Shiqing Gao | Junran Zhang | Sibo Zhang | Zeyuan Wang | Ke Wu | Xin Ding | Luoyi Fu | Meng Jin | Xinbing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yule Xie | Jiaxin Ding | Cheng Deng | Shiqing Gao | Junran Zhang | Sibo Zhang | Zeyuan Wang | Ke Wu | Xin Ding | Luoyi Fu | Meng Jin | Xinbing Wang
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) has recently shown remarkable ability to enhance reasoning in large language models (LLMs), yet its potential in scientific domains beyond mathematics remains largely unexplored. Geoscience questions couple broad factual knowledge with multi-step inference and often rely on visual evidence such as maps, cross-sections, and diagrams, making them a challenging but verifiable testbed for RL-based reasoning. To enable this study, we introduce GeoMC-10K, a dataset of 10,000 geoscience multiple-choice questions spanning physical to human geography and high-school to professional levels; over 30% of the questions are image dependent. To support text-only RL on these multimodal questions, we design GeoM2T, a multi-agent framework that converts multimodal questions into descriptive text while preserving answerability and difficulty. Fine-tuning LLaMA-3.1-8B and Qwen-3-8B with Group Relative Policy Optimization (GRPO), incorporating a factual reward mechanism, yields GR1, which achieves absolute accuracy improvements of 5.9% and 13.3%, respectively, and it generalizes to out-of-distribution geoscience benchmarks. Together, GeoMC-10K, GeoM2T, and GR1 establish a scalable benchmark and baseline for RL-enhanced geoscience reasoning.
2024
RepEval: Effective Text Evaluation with LLM Representation
Shuqian Sheng | Yi Xu | Tianhang Zhang | Zanwei Shen | Luoyi Fu | Jiaxin Ding | Lei Zhou | Xiaoying Gan | Xinbing Wang | Chenghu Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Shuqian Sheng | Yi Xu | Tianhang Zhang | Zanwei Shen | Luoyi Fu | Jiaxin Ding | Lei Zhou | Xiaoying Gan | Xinbing Wang | Chenghu Zhou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text evaluation are often tailored to specific scenarios, while LLM-based evaluation metrics are costly, requiring fine-tuning or rely heavily on the generation capabilities of LLMs. Besides, previous LLM-based metrics ignore the fact that, within the space of LLM representations, there exist direction vectors that indicate the estimation of text quality. To this end, we introduce RepEval, a metric that leverages the projection of LLM representations for evaluation. Through simple prompt modifications, RepEval can easily transition to various tasks, requiring only minimal sample pairs for direction vector construction. Results on fourteen datasets across two evaluation tasks demonstrate the high effectiveness of our method, which exhibits a higher correlation with human judgments than previous methods, even in complex evaluation scenarios involving pair-wise selection under nuanced aspects. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
Is Reference Necessary in the Evaluation of NLG Systems? When and Where?
Shuqian Sheng | Yi Xu | Luoyi Fu | Jiaxin Ding | Lei Zhou | Xinbing Wang | Chenghu Zhou
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Shuqian Sheng | Yi Xu | Luoyi Fu | Jiaxin Ding | Lei Zhou | Xinbing Wang | Chenghu Zhou
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The majority of automatic metrics for evaluating NLG systems are reference-based. However, the challenge of collecting human annotation results in a lack of reliable references in numerous application scenarios. Despite recent advancements in reference-free metrics, it has not been well understood when and where they can be used as an alternative to reference-based metrics. In this study, by employing diverse analytical approaches, we comprehensively assess the performance of both metrics across a wide range of NLG tasks, encompassing eight datasets and eight evaluation models. Based on solid experiments, the results show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality. However, their effectiveness varies across tasks and is influenced by the quality of candidate texts. Therefore, it’s important to assess the performance of reference-free metrics before applying them to a new task, especially when inputs are in uncommon form or when the answer space is highly variable. Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.