Guo Gan


2025

pdf bib
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval
Tingyu Song | Guo Gan | Mingsheng Shang | Yilun Zhao
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce IFIR, the first comprehensive benchmark designed to evaluate instruction-following information retrieval (IR) in expert domains. IFIR includes 2,426 high-quality examples and covers eight subsets across four specialized domains: finance, law, healthcare, and science literature. Each subset addresses one or more domain-specific retrieval tasks, replicating real-world scenarios where customized instructions are critical. IFIR enables a detailed analysis of instruction-following retrieval capabilities by incorporating instructions at different levels of complexity. We also propose a novel LLM-based evaluation method to provide a more precise and reliable assessment of model performance in following instructions. Through extensive experiments on 15 frontier retrieval models, including those based on LLMs, our results reveal that current models face significant challenges in effectively following complex, domain-specific instructions. We further provide in-depth analyses to highlight these limitations, offering valuable insights to guide future advancements in retriever development.

pdf bib
Are Multimodal LLMs Robust Against Adversarial Perturbations? RoMMath: A Systematic Evaluation on Multimodal Math Reasoning
Yilun Zhao | Guo Gan | Chen Zhao | Arman Cohan
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce RoMMath, the first benchmark designed to evaluate the capabilities and robustness of multimodal large language models (MLLMs) in handling multimodal math reasoning, particularly when faced with adversarial perturbations. RoMMath consists of 4,800 expert-annotated examples, including an original set and seven adversarial sets, each targeting a specific type of perturbation at the text or vision levels. We evaluate a broad spectrum of 17 MLLMs on RoMMath and uncover a critical challenge regarding model robustness against adversarial perturbations. Through detailed error analysis by human experts, we gain a deeper understanding of the current limitations of MLLMs. Additionally, we explore various approaches to enhance the performance and robustness of MLLMs, providing insights that can guide future research efforts.