Wenlong Zhang
2026
FlowSearch: Advancing Deep Research with Dynamic Structured Knowledge Flow
Yusong Hu | Runmin Ma | Yue Fan | Jinxin Shi | Zongsheng Cao | Yuhao Zhou | Jiakang Yuan | Shuaiyu Zhang | Shiyang Feng | Xiangchao Yan | Shufei Zhang | Wenlong Zhang | Lei Bai | Bo Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yusong Hu | Runmin Ma | Yue Fan | Jinxin Shi | Zongsheng Cao | Yuhao Zhou | Jiakang Yuan | Shuaiyu Zhang | Shiyang Feng | Xiangchao Yan | Shufei Zhang | Wenlong Zhang | Lei Bai | Bo Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Deep research is an inherently challenging task that demands both breadth and depth of thinking. It involves navigating diverse knowledge spaces and reasoning over complex, multi-step dependencies, which presents substantial challenges for agentic systems. To address this, we propose FlowSearch, a multi-agent framework that actively constructs and evolves a dynamic structured knowledge flow to drive subtask execution and reasoning. FlowSearch is capable of strategically planning and expanding the knowledge flow to enable parallel exploration and hierarchical task decomposition, while also adjusting the knowledge flow in real time based on feedback from intermediate reasoning outcomes and insights. FlowSearch achieves competitive performance on both general and scientific benchmarks, including GAIA, HLE, GPQA and TRQA, demonstrating its effectiveness in multi-disciplinary research scenarios and its potential to advance scientific discovery. The code will be available.
MSEarth: A Multimodal Benchmark for Earth Science Phenomenon Discovery with MLLMs
Xiangyu Zhao | Wanghan Xu | Bo Liu | Yuhao Zhou | Fenghua Ling | Ben Fei | Xiaoyu Yue | Lei Bai | Wenlong Zhang | Xiao-Ming Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangyu Zhao | Wanghan Xu | Bo Liu | Yuhao Zhou | Fenghua Ling | Ben Fei | Xiaoyu Yue | Lei Bai | Wenlong Zhang | Xiao-Ming Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of multimodal large language models (MLLMs) offers new opportunities for complex scientific challenges, yet their application in earth science—especially at the graduate level—remains underexplored due to a lack of benchmarks reflecting the depth and complexity of geoscientific reasoning. Existing datasets often rely on synthetic data or simple figure-caption pairs, failing to capture the nuanced reasoning required for real-world applications. To address this, we introduce MSEarth, a multimodal scientific dataset and benchmark curated from high-quality, open-access publications. Covering the five major spheres of Earth science—atmosphere, cryosphere, hydrosphere, lithosphere, and biosphere—MSEarth features over 289K figures with refined captions enriched by contextual discussions and reasoning from the original papers. The benchmark supports tasks such as scientific figure captioning, multiple choice questions, and open-ended reasoning, providing a scalable, high-fidelity resource for developing and evaluating MLLMs in scientific reasoning.
2024
UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation
Xiangyu Zhao | Yuehan Zhang | Wenlong Zhang | Xiao-Ming Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xiangyu Zhao | Yuehan Zhang | Wenlong Zhang | Xiao-Ming Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The fashion domain encompasses a variety of real-world multimodal tasks, including multimodal retrieval and multimodal generation. The rapid advancements in artificial intelligence generated content, particularly in technologies like large language models for text generation and diffusion models for visual generation, have sparked widespread research interest in applying these multimodal models in the fashion domain. However, tasks that use embeddings, such as image-to-text or text-to-image retrieval, have been largely ignored from this perspective due to the diverse nature of the multimodal fashion domain. And current research on multi-task single models lack focus on image generation. In this work, we present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain, integrating image generation with retrieval tasks and text generation tasks. UniFashion unifies embedding and generative tasks by integrating a diffusion model and LLM, enabling controllable and high-fidelity generation. Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks, and can be readily adapted to manage complex vision-language tasks. This work demonstrates the potential learning synergy between multimodal generation and retrieval, offering a promising direction for future research in the fashion domain.