DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Nicholas Popovic; Ashish Kangen; Tim Schopf; Michael Färber

DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations

Nicholas Popovic, Ashish Kangen, Tim Schopf, Michael Färber

Abstract

Large, high-quality annotated corpora remain scarce in document-level entity and relation extraction in zero-shot or few-shot settings.In this paper, we present a fully automatic, LLM-based pipeline for synthetic data generation and in-context learning for document-level entity and relation extraction.In contrast to existing approaches that rely on manually annotated demonstrations or direct zero-shot inference, our method combines synthetic data generation with retrieval-based in-context learning, using a reasoning-optimized language model.This allows us to build a high-quality demonstration database without manual annotation and to dynamically retrieve relevant examples at inference time.Based on our approach we produce a synthetic dataset of over 5k Wikipedia abstracts with approximately 59k entities and 30k relation triples.Finally, we evaluate in-context learning performance on the DocIE shared task, extracting entities and relations from long documents in a zero-shot setting.The code and synthetic dataset are made available for future research.

Anthology ID:: 2025.xllm-1.26
Volume:: Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Month:: August
Year:: 2025
Address:: Vienna, Austria
Editors:: Hao Fei, Kewei Tu, Yuhui Zhang, Xiang Hu, Wenjuan Han, Zixia Jia, Zilong Zheng, Yixin Cao, Meishan Zhang, Wei Lu, N. Siddharth, Lilja Øvrelid, Nianwen Xue, Yue Zhang
Venues:: XLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 298–309
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.xllm-1.26/
DOI:
Bibkey:
Cite (ACL):: Nicholas Popovic, Ashish Kangen, Tim Schopf, and Michael Färber. 2025. DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations. In Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025), pages 298–309, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: DocIE@XLLM25: In-Context Learning for Information Extraction using Fully Synthetic Demonstrations (Popovic et al., XLLM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.xllm-1.26.pdf

PDF Cite Search Fix data