SEAL: Structure and Element Aware Learning Improves Long Structured Document Retrieval

Xinhao Huang, Zhibo Ren, Yipeng Yu, Ying Zhou, Zulong Chen, Zeyi Wen


Abstract
In long structured document retrieval, existing methods typically fine-tune pre-trained language models (PLMs) using contrastive learning on datasets lacking explicit structural information. This practice suffers from two critical issues: 1) current methods fail to leverage structural features and element-level semantics effectively, and 2) the lack of datasets containing structural metadata. To bridge these gaps, we propose SEAL, a novel contrastive learning framework. It leverages structure-aware learning to preserve semantic hierarchies and masked element alignment for fine-grained semantic discrimination. Furthermore, we release StructDocRetrieval, a long structured document retrieval dataset with rich structural annotations. Extensive experiments on both the released and industrial datasets across various modern PLMs, and online A/B testing demonstrate consistent improvements, boosting NDCG@10 from 73.96% to 77.84% on BGE-M3. The resources are available at https://github.com/xinhaoH/SEAL.
Anthology ID:
2025.emnlp-main.429
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8537–8547
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.429/
DOI:
Bibkey:
Cite (ACL):
Xinhao Huang, Zhibo Ren, Yipeng Yu, Ying Zhou, Zulong Chen, and Zeyi Wen. 2025. SEAL: Structure and Element Aware Learning Improves Long Structured Document Retrieval. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8537–8547, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
SEAL: Structure and Element Aware Learning Improves Long Structured Document Retrieval (Huang et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.429.pdf
Checklist:
 2025.emnlp-main.429.checklist.pdf