A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows

Han Yuxuan, Yuanxing Zhang, Yushuo Wang, Yichao Jin


Abstract
Structured information extraction from long, multilingual scanned financial documents is a core requirement in industrial KYC and compliance workflows. These documents are typically non-machine-readable, noisy, and visually heterogeneous. They usually span dozens of pages while containing only sparse task-relevant information. Although recent vision–language models (VLMs) achieve strong benchmark performance, directly applying them end-to-end to full financial reports often leads to unreliable extraction under real-world conditions.We present a multistage extraction framework that integrates image preprocessing, multilingual OCR, hybrid page-level retrieval, and compact VLM-based structured extraction. The design separates page localization from multimodal reasoning, enabling more accurate extraction from complex multi-page documents.We evaluated the framework on 120 production KYC documents comprising about 3000 multilingual scanned pages. Across multiple OCR–VLM combinations, the proposed pipeline consistently outperforms direct PDF-to-VLM baselines, improving field-level accuracy by up to 31.9 percentage points. The best configuration, PaddleOCR with MiniCPM-o-2.6, achieves 87.27% accuracy. Ablation studies show that page-level retrieval is the dominant factor in performance improvements, particularly for complex financial statements and non-English documents.
Anthology ID:
2026.acl-industry.99
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1419–1433
Language:
URL:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-industry.99/
DOI:
Bibkey:
Cite (ACL):
Han Yuxuan, Yuanxing Zhang, Yushuo Wang, and Yichao Jin. 2026. A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 1419–1433, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
A Multistage Extraction Pipeline for Long Scanned Financial Documents: An Empirical Study in Industrial KYC Workflows (Yuxuan et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-industry.99.pdf