Ziyan Liu
2026
From Selection to Refinement: Iterative Optimization for Instruction Data
Hang Hu | Ziyan Liu | Rujie Wen | Ruihui Hou | Xueyan Wu | Mu Zhang | Jianxing Yu | Tong Ruan | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hang Hu | Ziyan Liu | Rujie Wen | Ruihui Hou | Xueyan Wu | Mu Zhang | Jianxing Yu | Tong Ruan | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Instruction tuning plays a crucial role in enhancing large language models (LLMs) to better understand complex user instructions. While various data selection and revision methods have been explored to optimize instruction tuning datasets, they face two main challenges: unreasonable pruning of potentially valuable low-quality data and the persistence of noise or semantic drift during revision. To address these issues, we propose a novel automated iterative framework for instruction data optimization. Our framework introduces Instruction Quality Differentiation to identify valuable high-quality and low-quality data across multiple dimensions. For low-quality data, we propose a Feedback-driven Iterative Refinement mechanism with an "evaluate-refine-review" process and design an Output Alignment module to improve data quality. Experiments on seven public benchmark datasets show that our framework outperforms state-of-the-art methods, achieving 2.09% and 2.60% improvements on the Alpaca and Dolly datasets, respectively, with high data efficiency. Our code and data are available at the anonymous link https://github.com/surihuhang/From-Selection-to-Refinement–Iterative-Optimization-for-Instruction-Data.
Tiny Scales, Great Challenges: The Limits of Multimodal LLMs in Scale Recognition
Jihang Jin | Ronghao Chen | Hao Zhang | Ziyan Liu | Huacan Wang | Qi Ye | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jihang Jin | Ronghao Chen | Hao Zhang | Ziyan Liu | Huacan Wang | Qi Ye | Jingping Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Visual scale recognition is a fundamental aspect for humans to perceive physical quantities in the real world, and it is crucial for enabling human-like intelligence in multimodal large language models (MLLMs). However, existing benchmarks typically focus on a single type of quantity (e.g., time) or a specific format (e.g., dials), lacking a comprehensive evaluation of scale recognition capabilities. To address these problems, we propose ScaleBench, a visual scale recognition benchmark built using images from COCO, Open Images, and Flickr, designed to comprehensively evaluate the scale recognition capabilities of MLLMs. To ensure high data quality, we develop detailed annotation guidelines and procedures, resulting in a total of 6,574 annotated samples. Based on this benchmark, we evaluate multiple closed-source and open-source MLLMs. Experimental results reveal that the best-performing model achieves only 42.60% accuracy, far lower than the 97.40% of humans. Furthermore, we conduct in-depth experimental analyses and provide future research directions. Our benchmark and implementation codes are available at https://github.com/Sonder-hang/ScaleBench.
MirrorQA: Benchmarking Multimodal LLMs on Mirror-Orientation Reasoning
Jingping Liu | Xingchen Peng | Yan Zhou | Ziyan Liu | Jie Zhai | Ronghao Chen | Huacan Wang | Xiaofeng Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingping Liu | Xingchen Peng | Yan Zhou | Ziyan Liu | Jie Zhai | Ronghao Chen | Huacan Wang | Xiaofeng Jia
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal large language models (MLLMs) have achieved remarkable progress in recent years, yet their ability to perform left–right reasoning in mirror contexts—a fundamental element of spatial cognition—remains underexplored. To address this gap, we introduce MirrorQA, a manually constructed benchmark with 5,549 samples, designed to evaluate MLLMs’ capability to distinguish left from right from a subject-centered perspective. MirrorQA is built through a three-stage pipeline (annotation, verification, and final review) to ensure high-quality labeling. Comprehensive evaluations on both open- and closed-source MLLMs show that even the best-performing models achieve only 65.40% accuracy, far below the 99.28% accuracy of humans. These results highlight substantial challenges in current MLLMs when reasoning about left and right, and point to promising directions for future research. MirrorQA and its code are publicly available at anonymous link https://github.com/stargazer-zeno/MirrorQA.
2025
RedOne: Revealing Domain-specific LLM Post-Training in Social Networking Services
Fei Zhao | Chonggang Lu | Wangyue | Zheyong Xie | Ziyan Liu | Haofu Qian | Jianzhao Huang | Fangcheng Shi | Zijie Meng | Hongcheng Guo | Mingqian He | Xinze Lyu | Zheyu Ye | Weiting Liu | Boyang Wang | Shaosheng Cao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Fei Zhao | Chonggang Lu | Wangyue | Zheyong Xie | Ziyan Liu | Haofu Qian | Jianzhao Huang | Fangcheng Shi | Zijie Meng | Hongcheng Guo | Mingqian He | Xinze Lyu | Zheyu Ye | Weiting Liu | Boyang Wang | Shaosheng Cao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
As a primary medium for modern information dissemination, social networking services (SNS) have experienced rapid growth, which has proposed significant challenges for platform content management and interaction quality improvement. Recently, the development of large language models (LLMs) has offered potential solutions but existing studies focus on isolated tasks, which not only encounter diminishing benefit from the data scaling within individual scenarios but also fail to flexibly adapt to diverse real-world context. To address these challenges, we introduce RedOne, a domain-specific LLM designed to break the performance bottleneck of single-task baselines and establish a comprehensive foundation for the SNS. RedOne was developed through a three-stage training strategy consisting of continue pretraining, supervised fine-tuning, and preference optimization, using a large-scale real-world dataset. Through extensive experiments, RedOne maintains strong general capabilities, and achieves an average improvement up to 14.02% across 8 major SNS tasks and 7.56% in SNS bilingual evaluation benchmark, compared with base models. Furthermore, through online testing, RedOne reduced the exposure rate in harmful content detection by 11.23% and improved the click page rate in post-view search by 14.95% compared with single-tasks baseline models. These results establish RedOne as a robust domain-specific LLM for SNS, demonstrating excellent generalization across various tasks and promising applicability in real-world scenarios.
MIND: A Multi-agent Framework for Zero-shot Harmful Meme Detection
Ziyan Liu | Chunxiao Fan | Haoran Lou | Yuexin Wu | Kaiwei Deng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ziyan Liu | Chunxiao Fan | Haoran Lou | Yuexin Wu | Kaiwei Deng
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid expansion of memes on social media has highlighted the urgent need for effective approaches to detect harmful content. However, traditional data-driven approaches struggle to detect new memes due to their evolving nature and the lack of up-to-date annotated data. To address this issue, we propose MIND, a multi-agent framework for zero-shot harmful meme detection that does not rely on annotated data. MIND implements three key strategies: 1) We retrieve similar memes from an unannotated reference set to provide contextual information. 2) We propose a bi-directional insight derivation mechanism to extract a comprehensive understanding of similar memes. 3) We then employ a multi-agent debate mechanism to ensure robust decision-making through reasoned arbitration. Extensive experiments on three meme datasets demonstrate that our proposed framework not only outperforms existing zero-shot approaches but also shows strong generalization across different model architectures and parameter scales, providing a scalable solution for harmful meme detection.
Can Multimodal Large Language Models Understand Spatial Relations?
Jingping Liu | Ziyan Liu | Zhedong Cen | Yan Zhou | Yinan Zou | Weiyan Zhang | Haiyun Jiang | Tong Ruan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingping Liu | Ziyan Liu | Zhedong Cen | Yan Zhou | Yinan Zou | Weiyan Zhang | Haiyun Jiang | Tong Ruan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Spatial relation reasoning is a crucial task for multimodal large language models (MLLMs) to understand the objective world. However, current benchmarks have issues like relying on bounding boxes, ignoring perspective substitutions, or allowing questions to be answered using only the model’s prior knowledge without image understanding. To address these issues, we introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO2017, which enables MLLMs to focus more on understanding images in the objective world. To ensure data quality, we design a well-tailored annotation procedure, resulting in SpatialMQA consisting of 5,392 samples. Based on this benchmark, a series of closed- and open-source MLLMs are implemented and the results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%. Extensive experimental analyses are also conducted, suggesting the future research directions. The benchmark and codes are available at https://huggingface.co/datasets/liuziyan/SpatialMQA.
Search
Fix author
Co-authors
- Jingping Liu 4
- Ronghao Chen 2
- Tong Ruan 2
- Huacan Wang 2
- Yan Zhou 2
- Shaosheng Cao 1
- Zhedong Cen 1
- Kaiwei Deng 1
- Chunxiao Fan 1
- Hongcheng Guo 1
- Mingqian He 1
- Ruihui Hou 1
- Hang Hu 1
- Jianzhao Huang 1
- Xiaofeng Jia 1
- Haiyun Jiang 1
- Jihang Jin 1
- Weiting Liu 1
- Haoran Lou 1
- Chonggang Lu 1
- Xinze Lyu 1
- Zijie Meng 1
- Xingchen Peng 1
- Haofu Qian 1
- Fangcheng Shi 1
- Boyang Wang 1
- Wangyue 1
- Rujie Wen 1
- Xueyan Wu 1
- Yuexin Wu 1
- Zheyong Xie 1
- Qi Ye 1
- Zheyu Ye 1
- Jianxing Yu 1
- Jie Zhai 1
- Hao Zhang 1
- Mu Zhang 1
- Weiyan Zhang 1
- Fei Zhao 1
- Yinan Zou 1