Jeongsoo Lee

2026

MHTS: Multi-Hop Tree Structure Framework for Generating Difficulty-Controllable QA Datasets for RAG Evaluation
Jeongsoo Lee | Daeyong Kwon | Kyohoon Jin | JunNyeong Jeong | Minwoo Sim | Minwoo Kim
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Existing RAG benchmarks often overlook query difficulty, leading to inflated performance on simpler questions and unreliable evaluations. A robust benchmark dataset must satisfy three key criteria: quality, ensuring complete and reliable ground truth (GT) responses; diversity, expanding semantic coverage to prevent overfitting; and difficulty, capturing the complexity of reasoning based on hops and the distribution of supporting evidence. However, current benchmarks lack a systematic approach to defining and controlling query difficulty at a fine-grained level. To address this, we propose MHTS (Multi-Hop Tree Structure), a novel dataset synthesis framework that systematically controls multi-hop reasoning complexity by leveraging a multi-hop tree structure to generate logically connected, multi-chunk queries. Our fine-grained difficulty estimation formula exhibits a strong correlation with the overall performance metrics of a RAG system, validating its effectiveness in assessing both retrieval and answer generation capabilities. By ensuring high-quality, diverse, and difficulty-controlled queries, our approach enhances RAG evaluation and benchmarking capabilities. This work contributes to the development of more reliable, efficient, and adaptable AI-driven research assistants, facilitating advancements in document-based reasoning and retrieval tasks.

2025

pdf bib abs

GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation
Jeongsoo Lee | Daeyong Kwon | Kyohoon Jin
Findings of the Association for Computational Linguistics: EMNLP 2025

Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks, but current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios. These benchmarks overlook key factors such as the interaction between retrieval difficulty and reasoning depth. To address this gap, we propose GRADE, a novel evaluation framework that models task difficulty along two orthogonal dimensions: (1) reasoning depth, defined by the number of inference steps (hops), and (2) semantic distance between the query and its supporting evidence. We construct a synthetic multi-hop QA dataset from factual news articles by extracting knowledge graphs and augmenting them through semantic clustering to recover missing links, allowing us to generate diverse and difficulty-controlled queries. Central to our framework is a 2D difficulty matrix that combines generator-side and retriever-side difficulty. Experiments across multiple domains and models show that error rates strongly correlate with our difficulty measures, validating their diagnostic utility. GRADE enables fine-grained analysis of RAG performance and provides a scalable foundation for evaluating and improving multi-hop reasoning in real-world applications.

Co-authors

Venues

Findings1
LREC1

Fix author