FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition

Jonas Golde, Patrick Haller, Alan Akbik


Abstract
Recent multilingual named entity recognition (NER) work has shown that large language models (LLMs) can provide effective synthetic supervision, yet such datasets have mostly appeared as by-products of broader experiments rather than as systematic, reusable resources. We introduce , a dataset-creation pipeline that scales the teacher-student paradigm to 91 languages and 25 scripts. Building on FineWeb-Edu, our approach trains regression models to identify NER-relevant passages and annotates them with multilingual LLMs, resulting in about 225k passages with 235k distinct entity labels. Our experiments show that the regression model achieves more than 84 F1, and that models trained on FiNERweb obtain comparable or improved performance in zero shot transfer settings on English, Thai, and Swahili, despite being trained on 19x less data than strong baselines. In addition, we assess annotation quality using LLM-as-a-judge and observe consistently high scores for both faithfulness (3.99/5) and completeness (4.05/5), indicating reliable and informative annotations. Further, we release the dataset with both English labels and translated label sets in the respective target languages because we observe that the performance of current state-of-the-art models drops by 0.02-0.09 F1 when evaluated using target language labels instead of English ones. We release FiNERweb together with all accompanying artifacts to the research community in order to facilitate more effective student-teacher training for multilingual named entity recognition.
Anthology ID:
2026.findings-eacl.121
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2281–2300
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.121/
DOI:
Bibkey:
Cite (ACL):
Jonas Golde, Patrick Haller, and Alan Akbik. 2026. FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition. In Findings of the Association for Computational Linguistics: EACL 2026, pages 2281–2300, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
FiNERweb: Datasets and Artifacts for Scalable Multilingual Named Entity Recognition (Golde et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.121.pdf
Checklist:
 2026.findings-eacl.121.checklist.pdf