A Bit of This, a Bit of That: Building a Genre and Topic Annotated Dataset of Historical Newspaper Articles with Soft Labels and Confidence Scores
Karin Stahel, Irenie How, Lauren Millar, Luis Paterson, Daniel Steel, Kaspar Middendorf
Abstract
Digitised historical newspaper collections are becoming increasingly accessible, yet their scale and diverse content still present challenges for researchers interested in specific article types or topics. In a step towards developing models to address these challenges, we have created a dataset of articles from New Zealand’s Papers Past open data annotated with multiple genre and topic labels and annotator confidence scores. Our annotation framework aligns with the perspectivist approach to machine learning, acknowledging the subjective nature of the task and embracing the hybridity and uncertainty of genres. In this paper, we describe our sampling and annotation methods and the resulting dataset of 7,036 articles from 106 New Zealand newspapers spanning the period 1839-1903. This dataset will be used to develop interpretable classification models that enable fine-grained exploration and discovery of articles in Papers Past newspapers based on common aspects of form, function, and topic. The complete dataset, including un-aggregated annotations and supporting documentation, will eventually be openly released to facilitate further research.- Anthology ID:
- 2025.nlp4dh-1.33
- Volume:
- Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
- Month:
- May
- Year:
- 2025
- Address:
- Albuquerque, USA
- Editors:
- Mika Hämäläinen, Emily Öhman, Yuri Bizzoni, So Miyagawa, Khalid Alnajjar
- Venues:
- NLP4DH | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 377–392
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.33/
- DOI:
- Cite (ACL):
- Karin Stahel, Irenie How, Lauren Millar, Luis Paterson, Daniel Steel, and Kaspar Middendorf. 2025. A Bit of This, a Bit of That: Building a Genre and Topic Annotated Dataset of Historical Newspaper Articles with Soft Labels and Confidence Scores. In Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities, pages 377–392, Albuquerque, USA. Association for Computational Linguistics.
- Cite (Informal):
- A Bit of This, a Bit of That: Building a Genre and Topic Annotated Dataset of Historical Newspaper Articles with Soft Labels and Confidence Scores (Stahel et al., NLP4DH 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.nlp4dh-1.33.pdf