Haoxin Zhang
2026
R³A: Reinforced Reasoning for Relevance Assessment for RAG in User-Generated Content Platforms
Xiaowei Yuan | Lei Jin | Haoxin Zhang | Ziyang Huang | Yan Gao | Yiwu | Yao Hu | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Xiaowei Yuan | Lei Jin | Haoxin Zhang | Ziyang Huang | Yan Gao | Yiwu | Yao Hu | Jun Zhao | Kang Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Retrieval-augmented generation (RAG) plays a critical role in user-generated content (UGC) platforms, but its effectiveness critically depends on accurate query–document relevance assessment. Despite recent advances in applying large language models (LLMs) to relevance modeling, UGC platforms present unique challenges: 1) ambiguous user intent due to sparse user feedback in RAG scenarios, and 2) asymmetric relevance, where relevance is driven by localized answer-bearing content rather than global query–document similarity. To address these issues, we propose the Reinforced Reasoning model for Relevance Assessment (R³A), which decomposes relevance assessment into intent inference and evidence grounding. R³A leverages auxiliary high-clicked documents to infer latent query intent, and extracts verbatim evidence fragments to ground relevance decisions, reducing noise sensitivity and improving asymmetric relevance modeling. Experimental results demonstrate that R³A substantially outperforms strong baselines on offline benchmarks, while the distilled R³A-1.5B model achieves significant gains in large-scale online A/B testing, effectively balancing performance and practical deployability.
2025
RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios
Fei Zhao | Chengqiang Lu | Yufan Shen | Qimeng Wang | Yicheng Qian | Haoxin Zhang | Yan Gao | Yiwu | Yao Hu | Zhen Wu | Shangyu Xing | Xinyu Dai
Findings of the Association for Computational Linguistics: EMNLP 2025
Fei Zhao | Chengqiang Lu | Yufan Shen | Qimeng Wang | Yicheng Qian | Haoxin Zhang | Yan Gao | Yiwu | Yao Hu | Zhen Wu | Shangyu Xing | Xinyu Dai
Findings of the Association for Computational Linguistics: EMNLP 2025
While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context. Our datasets will be publicly available.