Weijia Li
2026
RoZO: Geometry-Aware Zeroth-Order Fine-Tuning on Low-Rank Adapters for Black-Box Large Language Models
Zichen Song | Weijia Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Zichen Song | Weijia Li
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved remarkable success across a wide range of tasks, yet fine-tuning them efficiently under black-box or memory-constrained settings remains challenging. Parameter-efficient fine-tuning (PEFT) techniques such as LoRA alleviate memory usage by restricting updates to low-rank adapters, while zeroth-order (ZO) optimization further avoids back-propagation by estimating gradients from function evaluations. Recent work, such as LOZO, leverages random low-rank perturbations to reduce the variance of ZO estimates, but it overlooks the intrinsic geometric structure of LoRA adapters and suffers from unstable convergence and limited integration with adaptive optimizers. To address these limitations, we propose RoZO, a Riemannian zeroth-order optimization framework that constrains updates to the tangent space of the LoRA manifold. By exploiting geometry-aware updates with parallel transport, adaptive preconditioning, and trust-region control, RoZO achieves more stable convergence, tighter variance bounds, and superior performance compared to existing ZO methods.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Junbo Niu | Zheng Liu | Zhuangcheng Gu | Bin Wang | Linke Ouyang | Zhiyuan Zhao | Tao Chu | Tianyao He | Fan Wu | Qintong Zhang | Zhenjiang Jin | Guang Liang | Rui Zhang | Wenzheng Zhang | Yuan Qu | Zhifei Ren | Yuefeng Sun | Zirui Tang | Boyu Niu | Yuanhong Zheng | Dongsheng Ma | Ziyang Miao | Hejun Dong | Siyi Qian | Junyuan Zhang | Fangdong Wang | Jingzhou Chen | Xiaomeng Zhao | Liqun Wei | Wei Li | Shasha Wang | RuiLiang Xu | Yuanyuan Cao | Lu Chen | Qianqian Wu | Huaiyu Gu | Lindong Lu | Dechen Lin | Shenguanlin | Xuanhe Zhou | Linfeng Zhang | Yuhang Zang | Xiaoyi Dong | Jiaqi Wang | Bo Zhang | Lei Bai | Pei Chu | Weijia Li | Jiang Wu | Lijun Wu | Zhenxiang Li | Guangyu Wang | Zhongying Tu | Chao Xu | Kai Chen | Bowen Zhou | Dahua Lin | Wentao Zhang | Conghui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Junbo Niu | Zheng Liu | Zhuangcheng Gu | Bin Wang | Linke Ouyang | Zhiyuan Zhao | Tao Chu | Tianyao He | Fan Wu | Qintong Zhang | Zhenjiang Jin | Guang Liang | Rui Zhang | Wenzheng Zhang | Yuan Qu | Zhifei Ren | Yuefeng Sun | Zirui Tang | Boyu Niu | Yuanhong Zheng | Dongsheng Ma | Ziyang Miao | Hejun Dong | Siyi Qian | Junyuan Zhang | Fangdong Wang | Jingzhou Chen | Xiaomeng Zhao | Liqun Wei | Wei Li | Shasha Wang | RuiLiang Xu | Yuanyuan Cao | Lu Chen | Qianqian Wu | Huaiyu Gu | Lindong Lu | Dechen Lin | Shenguanlin | Xuanhe Zhou | Linfeng Zhang | Yuhang Zang | Xiaoyi Dong | Jiaqi Wang | Bo Zhang | Lei Bai | Pei Chu | Weijia Li | Jiang Wu | Lijun Wu | Zhenxiang Li | Guangyu Wang | Zhongying Tu | Chao Xu | Kai Chen | Bowen Zhou | Dahua Lin | Wentao Zhang | Conghui He
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.
2025
Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?
Zichen Wen | Yifeng Gao | Weijia Li | Conghui He | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Zichen Wen | Yifeng Gao | Weijia Li | Conghui He | Linfeng Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods. Codes are available in the supplementary materials.
Stop Looking for “Important Tokens” in Multimodal Language Models: Duplication Matters More
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Zichen Wen | Yifeng Gao | Shaobo Wang | Junyuan Zhang | Qintong Zhang | Weijia Li | Conghui He | Linfeng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators. Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99× and 2.99× speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators.
Search
Fix author
Co-authors
- Conghui He 3
- Linfeng Zhang 3
- Yifeng Gao 2
- Zichen Wen 2
- Qintong Zhang 2
- Lei Bai 1
- Yuanyuan Cao 1
- Jingzhou Chen 1
- Kai Chen 1
- Lu Chen 1
- Pei Chu 1
- Tao Chu 1
- Hejun Dong 1
- Xiaoyi Dong 1
- Huaiyu Gu 1
- Zhuangcheng Gu 1
- Tianyao He 1
- Zhenjiang Jin 1
- Wei Li 1
- Zhenxiang Li 1
- Guang Liang 1
- Dahua Lin 1
- Dechen Lin 1
- Zheng Liu 1
- Lindong Lu 1
- Dongsheng Ma 1
- Ziyang Miao 1
- Boyu Niu 1
- Junbo Niu 1
- Linke Ouyang 1
- Siyi Qian 1
- Yuan Qu 1
- Zhifei Ren 1
- Shenguanlin 1
- Zichen Song 1
- Yuefeng Sun 1
- Zirui Tang 1
- Zhongying Tu 1
- Bin Wang 1
- Fangdong Wang 1
- Guangyu Wang 1
- Jiaqi Wang 1
- Shaobo Wang 1
- Shasha Wang 1
- Liqun Wei 1
- Fan Wu 1
- Jiang Wu 1
- Lijun Wu 1
- Qianqian Wu 1
- Chao Xu 1
- RuiLiang Xu 1
- Yuhang Zang 1
- Bo Zhang 1
- Junyuan Zhang 1
- Junyuan Zhang 1
- Rui Zhang 1
- Wentao Zhang 1
- Wenzheng Zhang 1
- Xiaomeng Zhao 1
- Zhiyuan Zhao 1
- Yuanhong Zheng 1
- Bowen Zhou 1
- Xuanhe Zhou 1