Alireza Razzaghi
2026
ParsCORE: The Persian Corpus of Online Registers
Alireza Razzaghi | Erik Henriksson | Veronika Laipalla
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Alireza Razzaghi | Erik Henriksson | Veronika Laipalla
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Despite recent advances in automatic web register (genre) labeling and its applications to web-scale datasets and LLM development, the effectiveness of these tools for digitally lowresource languages remains unclear. This study introduces ParsCORE, the first largescale collection of Persian web registers (genres), and evaluates deep learning models for register classification and keyword analysis across major registers. Using 2,000 humanannotated documents, the models achieved a micro F1-score of 0.76. The findings provide a foundation for future research on the linguistic and cultural specificities of Persian registers.
Register Mixing Is the Norm on the Web
Erik Henriksson | Alireza Razzaghi | Tuomas Lundberg | Antti Kanner | Veronika Laippala
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Erik Henriksson | Alireza Razzaghi | Tuomas Lundberg | Antti Kanner | Veronika Laippala
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Nearly all studies on web registers—online text varieties associated with characteristic social contexts and linguistic features—use full documents as the unit of analysis. However, web documents often contain sections in different registers. A cooking blog, for instance, may combine personal storytelling, recipe instructions, user comments, and promotional text within a single URL. This internal variation raises doubts about the validity of document level register labeling. In this paper, we propose an LLM-based approach that identifies register homogeneous segments within documents and apply it to a 10,000-document English sample from HPLT 3.0. We show that segmentation addresses persistent problems in register analysis, including low inter-annotator agreement and category fuzziness. Strikingly, it also reveals that most web documents contain more than one register, making register mixing the norm rather than the exception on the web.