LocalGovPL: A Corpus of Speaker-Attributed Polish Local Government Transcripts

Dariusz Czerski, Maciej Ogrodniczuk


Abstract
We present LocalGovPL, a large-scale, speaker-annotated corpus of Polish local government meeting transcripts processed using an automatic two-stage LLM pipeline. The corpus consists of 31,900 sessions from 749 councils recorded between 2018–2025 (approximately 391M words). It is released in TEI P5 format with explicit links between utterances and registered participants. We collect transcripts from official local government portals using a dedicated crawler, normalize the text, and apply: (1) LLM-assisted extraction of person names and administrative roles; and (2) attribution of utterances to identified speakers using discourse cues. To evaluate attribution quality, we manually annotate 30 sessions and evaluate five LLM configurations using three evaluation protocols with speaker-aware word error rate (sWER). The strongest system, Gemini-2.5-pro, achieves 3.9% sWER for abstract speaker identification, 4.6% for known participants, and 5.9% for end-to-end processing with relaxed name matching. LocalGovPL enables large-scale analysis of local deliberative discourse and supports research on dialogue modeling, summarization, and political text analysis.
Anthology ID:
2026.lrec-main.626
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
7883–7893
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.626/
DOI:
Bibkey:
Cite (ACL):
Dariusz Czerski and Maciej Ogrodniczuk. 2026. LocalGovPL: A Corpus of Speaker-Attributed Polish Local Government Transcripts. International Conference on Language Resources and Evaluation, main:7883–7893.
Cite (Informal):
LocalGovPL: A Corpus of Speaker-Attributed Polish Local Government Transcripts (Czerski & Ogrodniczuk, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.626.pdf