Di Zhang
Other people with similar names: Di Zhang, Di Zhang
Unverified author pages with similar names: Di Zhang
2025
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation
Xinlong Chen | Yuanxing Zhang | Chongling Rao | Yushuo Guan | Jiaheng Liu | Fuzheng Zhang | Chengru Song | Qiang Liu | Di Zhang | Tieniu Tan
Findings of the Association for Computational Linguistics: ACL 2025
Xinlong Chen | Yuanxing Zhang | Chongling Rao | Yushuo Guan | Jiaheng Liu | Fuzheng Zhang | Chengru Song | Qiang Liu | Di Zhang | Tieniu Tan
Findings of the Association for Computational Linguistics: ACL 2025
The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces VidCapBench, a video caption evaluation scheme specifically designed for T2V generation, agnostic to any particular caption format. VidCapBench employs a data annotation pipeline, combining expert model labeling and human refinement, to associate each collected video with key information spanning video aesthetics, content, motion, and physical laws. VidCapBench then partitions these key information attributes into automatically assessable and manually assessable subsets, catering to both the rapid evaluation needs of agile development and the accuracy requirements of thorough validation. By evaluating numerous state-of-the-art captioning models, we demonstrate the superior stability and comprehensiveness of VidCapBench compared to existing video captioning evaluation approaches. Verification with off-the-shelf T2V models reveals a significant positive correlation between scores on VidCapBench and the T2V quality evaluation metrics, indicating that VidCapBench can provide valuable guidance for training T2V models. The project is available at https://github.com/VidCapBench/VidCapBench.
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs
Minxuan Lv | Zhenpeng Su | Leiyu Pan | Yizhe Xiong | Zijia Lin | Hui Chen | Wei Zhou | Jungong Han | Guiguang Ding | Wenwu Ou | Di Zhang | Kun Gai | Songlin Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Minxuan Lv | Zhenpeng Su | Leiyu Pan | Yizhe Xiong | Zijia Lin | Hui Chen | Wei Zhou | Jungong Han | Guiguang Ding | Wenwu Ou | Di Zhang | Kun Gai | Songlin Hu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As large language models continue to scale, computational costs and resource consumption have emerged as significant challenges. While existing sparsification methods like pruning reduce computational overhead, they risk losing model knowledge through parameter removal. This paper proposes DSMoE (Dynamic Sparse Mixture-of-Experts), a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks. We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge based on input complexity. Additionally, we introduce a sparsity loss term to balance performance and computational efficiency. Extensive experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches across language modeling and downstream tasks, particularly excelling in generation tasks. Analysis reveals that DSMoE learns distinctive layerwise activation patterns, providing new insights for future MoE architecture design.
iMOVE : Instance-Motion-Aware Video Understanding
Jiaze Li | Yaya Shi | Zongyang Ma | Haoran Xu | Yandong.bai Yandong.bai | Huihui Xiao | Ruiwen Kang | Fan Yang | Tingting Gao | Di Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Jiaze Li | Yaya Shi | Zongyang Ma | Haoran Xu | Yandong.bai Yandong.bai | Huihui Xiao | Ruiwen Kang | Fan Yang | Tingting Gao | Di Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Enhancing the fine-grained instance spatiotemporal motion perception capabilities of Video Large Language Models is crucial for improving their temporal and general video understanding. However, current models struggle to perceive detailed and complex instance motions. To address these challenges, we have made improvements from both data and model perspectives. In terms of data, we have meticulously curated iMOVE-IT, the first large-scale instance-motion-aware video instruction-tuning dataset. This dataset is enriched with comprehensive instance motion annotations and spatiotemporal mutual-supervision tasks, providing extensive training for the model’s instance-motion-awareness. Building on this foundation, we introduce iMOVE, an instance-motion-aware video foundation model that utilizes Event-aware Spatiotemporal Efficient Modeling to retain informative instance spatiotemporal motion details while maintaining computational efficiency. It also incorporates Relative Spatiotemporal Position Tokens to ensure awareness of instance spatiotemporal positions. Evaluations indicate that iMOVE excels not only in video temporal understanding and general video understanding but also demonstrates significant advantages in long-term video understanding. We will release the data, code, and model weights after acceptance.
HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models
Xiao Wang | Jingyun Hua | Weihong Lin | Yuanxing Zhang | Fuzheng Zhang | Jianlong Wu | Di Zhang | Liqiang Nie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiao Wang | Jingyun Hua | Weihong Lin | Yuanxing Zhang | Fuzheng Zhang | Jianlong Wu | Di Zhang | Liqiang Nie
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. **HAICTrain** comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, **HAICBench** includes 412 manually annotated video-caption pairs and 2,000 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench will be made open-source to facilitate further research.
SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin
Hao Yi | Qingyang Li | Yulan Hu | Fuzheng Zhang | Di Zhang | Yong Liu
Findings of the Association for Computational Linguistics: EMNLP 2025
Hao Yi | Qingyang Li | Yulan Hu | Fuzheng Zhang | Di Zhang | Yong Liu
Findings of the Association for Computational Linguistics: EMNLP 2025
Enhancing the numerical and logical reasoning capabilities of Large Language Models (LLMs) has become a prominent research focus. Existing approaches exhibit notable limitations: inference-phase techniques, such as Chain of Thought, depend on prompt engineering and pretrained knowledge; sentence-level Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) struggle to ensure step-wise mathematical correctness and often rely on model distillation or human annotations; Reinforcement Learning (RL) methods entail high GPU memory consumption and training instability. To overcome these challenges, we propose Self-training with Process Preference learning using Dynamic value margin (SPPD). SPPD formulates reasoning as a process-based Markov Decision Process (MDP), leveraging the Bellman optimality equation to derive a dynamic value margin for step-level preference optimization. It further incorporates tree-based self-sampling of model responses, eliminating the need for distillation. We theoretically establish that SPPD is equivalent to on-policy policy gradient methods under constrained reward functions. Experimental results on 7B-scale models show consistent superiority across both in-domain and out-of-domain mathematical benchmarks.
Search
Fix author
Co-authors
- Fuzheng Zhang 3
- Yuanxing Zhang 2
- Xinlong Chen 1
- Hui Chen 1
- Guiguang Ding 1
- Kun Gai 1
- Tingting Gao 1
- Yushuo Guan 1
- Jungong Han 1
- Songlin Hu 1
- Yulan Hu 1
- Jingyun Hua 1
- Ruiwen Kang 1
- Jiaze Li 1
- Qingyang Li 1
- Zijia Lin 1
- Weihong Lin 1
- Jiaheng Liu 1
- Qiang Liu 1
- Yong Liu 1
- Minxuan Lv 1
- Zongyang Ma 1
- Liqiang Nie 1
- Wenwu Ou 1
- Leiyu Pan 1
- Chongling Rao 1
- Yaya Shi 1
- Chengru Song 1
- Zhenpeng Su 1
- Tieniu Tan 1
- Xiao Wang 1
- Jianlong Wu 1
- Huihui Xiao 1
- Yizhe Xiong 1
- Haoran Xu 1
- Yandong.bai Yandong.bai 1
- Fan Yang 1
- Hao Yi 1
- Wei Zhou 1