Zhibo Ren
2025
SEAL: Structure and Element Aware Learning Improves Long Structured Document Retrieval
Xinhao Huang
|
Zhibo Ren
|
Yipeng Yu
|
Ying Zhou
|
Zulong Chen
|
Zeyi Wen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose SEAL, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release StructDocRetrieval, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both the released and industrial datasets across various modern PLMs, and online A/B testing demonstrate consistent improvements, boosting NDCG@10 from 73.96% to 77.84% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.