MMUIE: Massive Multi-Domain Universal Information Extraction for Long Documents

Shuyi Zhang, Zhenbin Chen, Shuting Li, Kewei Tu, Li Jing, Zixia Jia, Zilong Zheng


Abstract
We present **MMUIE**, a large-scale universal dataset for multi-domain, document-level information extraction (IE) from long texts.Existing IE systems predominantly operate at the sentence level or within narrow domains due to annotation constraints.MMUIE addresses this gap by introducing an automated annotation pipeline that integrates traditional knowledge bases with large language models to extract fine-grained entities, aliases, and relation triples across 34 domains.The dataset comprises a weakly-supervised training set and a manually verified test set, featuring 723 entity types and 456 relation types.Empirical evaluations reveal that existing sentence-level IE models and even advanced LLMs underperform on this task, highlighting the need for better domain-aware document-level models.To this end, we develop DocUIE, a universal IE model fine-tuned on MMUIE, which achieves strong generalization and transferability across domains. MMUIE lays the foundation for robust, scalable, and universal information extraction from long-form text in diverse real-world scenarios. All code, data, and models are available in https://github.com/Shuyi-zsy/Massive-Multi-Domain-UIE.
Anthology ID:
2026.findings-eacl.334
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
6338–6370
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.334/
DOI:
Bibkey:
Cite (ACL):
Shuyi Zhang, Zhenbin Chen, Shuting Li, Kewei Tu, Li Jing, Zixia Jia, and Zilong Zheng. 2026. MMUIE: Massive Multi-Domain Universal Information Extraction for Long Documents. In Findings of the Association for Computational Linguistics: EACL 2026, pages 6338–6370, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
MMUIE: Massive Multi-Domain Universal Information Extraction for Long Documents (Zhang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.334.pdf
Checklist:
 2026.findings-eacl.334.checklist.pdf