StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness
Mengna Gao, Dapeng Yin, Shuyue Zhu, Bingxuan Hou, Zhanpeng Ni, Junli Wang
Abstract
Retrieval-Augmented Generation (RAG) systems rely on Optical Character Recognition (OCR) to ingest knowledge from unstructured documents. However, OCR engines often struggle with complex layouts, introducing Structural Noise, such as line insertion and paragraph interleaving, which disrupts the semantic flow of the text. Existing evaluations largely overlook this dimension, operating on the assumption of structurally perfect input. To bridge this gap, we introduce StruNRAG, a dedicated benchmark for evaluating RAG robustness against OCR-induced structural perturbations. We construct a bilingual dataset of 2,132 question-answer pairs derived from complex Chinese and English documents and systematically inject three categories of real-world structural noise: line insertion, paragraph interleaving, and line interleaving. Our evaluation of mainstream retrievers and Large Language Models (LLMs) reveals a nuanced interaction between noise and pipeline stages: while structural distortions consistently degrade retrieval performance, the generation stage exhibits unexpected robustness. Advanced LLMs demonstrate robustness against local noise (e.g., line insertion), but struggle to maintain reasoning capabilities under severe structural disruption that fragments global context. These findings indicate that while LLMs are capable of compensating for minor parsing errors, future RAG optimizations must take into account the effects of structural noise. Our code and datasets are available at [https://github.com/GaoMengnana/StruNRAG](https://github.com/GaoMengnana/StruNRAG).- Anthology ID:
- 2026.findings-acl.955
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 19129–19148
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.955/
- DOI:
- Cite (ACL):
- Mengna Gao, Dapeng Yin, Shuyue Zhu, Bingxuan Hou, Zhanpeng Ni, and Junli Wang. 2026. StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness. In Findings of the Association for Computational Linguistics: ACL 2026, pages 19129–19148, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- StruNRAG: Evaluation of OCR-Induced Structural Noise on RAG Robustness (Gao et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.955.pdf