Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge

Zhuoyi Yang, Yurun Song, Kyler G. Harris, Iftekhar Ahmed, Ian Harris


Abstract
Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has compared fine-tuning and retrieval-augmented generation (RAG) for factual recall and single-hop question answering, it remains unclear how these approaches perform in multi-hop settings that require compositional reasoning over temporally novel knowledge. In particular, prior comparisons often do not control for model scale, evaluation format, or knowledge freshness, making it difficult to isolate the effect of knowledge injection mechanisms.In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: Question Answering Science Challenge (QASC), a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, which is designed to test knowledge beyond the models’ pretraining cutoff.Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, RAG yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.
Anthology ID:
2026.gem-main.37
Volume:
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Simon Mille, Sebastian Gehrmann, Patrícia Schmidtová, Ondřej Dušek, Marzieh Fadaee, Kyle Lo, Enrico Santus, Gabriel Stanovsky
Venues:
GEM | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
384–392
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.37/
DOI:
Bibkey:
Cite (ACL):
Zhuoyi Yang, Yurun Song, Kyler G. Harris, Iftekhar Ahmed, and Ian Harris. 2026. Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge. In Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM), pages 384–392, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge (Yang et al., GEM 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.gem-main.37.pdf