Márton Kardos
2026
Dynaword: From One-shot to Continuously Developed Datasets
Kenneth Enevoldsen | Kristian Nørgaard Jensen | Jan Kostkan | Balázs Szabó | Márton Kardos | Kirsten Vad | Johan Heinsen | Andrea Blasi Núñez | Gianluca Barmina | Jacob Nielsen | Rasmus Larsen | Rob van der Goot | Peter Vahlstrup | Per Møldrup Dalum | Desmond Elliott | Lukas Galke Poech | Peter Schneider-Kamp | Kristoffer Nielbo
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Kenneth Enevoldsen | Kristian Nørgaard Jensen | Jan Kostkan | Balázs Szabó | Márton Kardos | Kirsten Vad | Johan Heinsen | Andrea Blasi Núñez | Gianluca Barmina | Jacob Nielsen | Rasmus Larsen | Rob van der Goot | Peter Vahlstrup | Per Møldrup Dalum | Desmond Elliott | Lukas Galke Poech | Peter Schneider-Kamp | Kristoffer Nielbo
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Large-scale datasets are foundational for research and development in natural language processing. However, current approaches face three key challenges: (1) reliance on ambiguously licensed sources restricting use, sharing, and derivative works; (2) static dataset releases that prevent community contributions and diminish longevity; and (3) quality assurance processes restricted to publishing teams rather than leveraging community expertise. To address these limitations, we introduce two contributions: the Dynaword approach and Danish Dynaword. The Dynaword approach is a framework for creating large-scale, open datasets that can be continuously updated through community collaboration. Danish Dynaword is a concrete implementation that validates this approach and demonstrates its potential. Danish Dynaword contains over five times as many tokens as comparable releases, is exclusively openly licensed, and has received multiple contributions across industry, the public sector and research institutions. The repository includes light-weight tests to ensure data formatting, quality, and documentation, establishing a sustainable framework for ongoing community contributions and dataset evolution.
2025
Modeling Multilayered Complexity in Literary Texts
Pascale Feldkamp | Márton Kardos | Kristoffer Nielbo | Yuri Bizzoni
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Pascale Feldkamp | Márton Kardos | Kristoffer Nielbo | Yuri Bizzoni
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
We explore the relationship between stylistic and sentimental complexity in literary texts, analyzing how they interact and affect overall complexity. Using a dataset of over 9,000 English novels (19th-20th century), we find that complexity at the stylistic/syntactic and sentiment levels tend to show a linear association. Finally, using dedicated datasets, we show that both stylistic/syntactic features – particularly those relating to information density – as well as sentiment features are related to text difficulty rank as well as average processing time.
S3 - Semantic Signal Separation
Márton Kardos | Jan Kostkan | Kenneth Enevoldsen | Arnault-Quentin Vermillet | Kristoffer Nielbo | Roberta Rocca
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Márton Kardos | Jan Kostkan | Kenneth Enevoldsen | Arnault-Quentin Vermillet | Kristoffer Nielbo | Roberta Rocca
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Recent efforts have been oriented at incorporating contextual representations in topic modeling and have been shown to outperform classical topic models. These approaches are typically slow, volatile, and require heavy preprocessing for optimal results. We present Semantic Signal Separation (S3), a theory-driven topic modeling approach in neural embedding spaces. S3 conceptualizes topics as independent axes of semantic space and uncovers these by decomposing contextualized document embeddings using Independent Component Analysis. Our approach provides diverse and highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextual topic model, being, on average, 4.5x faster than the runner-up BERTopic. We offer an implementation of S3, and all contextual baselines, in the Turftopic Python package.
topicwizard - a Modern, Model-agnostic Framework for Topic Model Visualization and Interpretation
Márton Kardos | Kenneth Enevoldsen | Kristoffer Nielbo
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)
Márton Kardos | Kenneth Enevoldsen | Kristoffer Nielbo
Proceedings of the 8th International Conference on Natural Language and Speech Processing (ICNLSP-2025)
2024
Canonical Status and Literary Influence: A Comparative Study of Danish Novels from the Modern Breakthrough (1870–1900)
Pascale Feldkamp | Alie Lassche | Jan Kostkan | Márton Kardos | Kenneth Enevoldsen | Katrine Baunvig | Kristoffer Nielbo
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
Pascale Feldkamp | Alie Lassche | Jan Kostkan | Márton Kardos | Kenneth Enevoldsen | Katrine Baunvig | Kristoffer Nielbo
Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities
We examine the relationship between the canonization of Danish novels and their textual innovation and influence, taking the Danish Modern Breakthrough era (1870–1900) as a case study. We evaluate whether canonical novels introduced a significant textual novelty in their time, and explore their influence on the overall literary trend of the period. By analyzing the positions of canonical versus non-canonical novels in semantic space, we seek to better understand the link between a novel’s canonical status and its literary impact. Additionally, we examine the overall diversification of Modern Breakthrough novels during this significant period of rising literary readership. We find that canonical novels stand out from both the historical novel genre and non-canonical novels of the period. Our findings on diversification within and across groups indicate that the novels now regarded as canonical served as literary trendsetters of their time.
2023
OdyCy – A general-purpose NLP pipeline for Ancient Greek
Jan Kostkan | Márton Kardos | Jacob Palle Bliddal Mortensen | Kristoffer Laigaard Nielbo
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Jan Kostkan | Márton Kardos | Jacob Palle Bliddal Mortensen | Kristoffer Laigaard Nielbo
Proceedings of the 7th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
This paper presents a general-purpose NLP pipeline that achieves state-of-the-art performance on the Ancient Greek Perseus UD Treebank for several tasks (POS Tagging, Morphological Analysis and Dependency Parsing), and close to state-of-the-art performance on the Proiel UD Treebank. Our aim is to provide a reproducible, open source language processing pipeline for Ancient Greek, capable of handling input texts of varying quality. We measure the performance of our model against other comparable tools and then evaluate lemmatization errors.
Search
Fix author
Co-authors
- Kristoffer Nielbo 5
- Kenneth Enevoldsen 4
- Jan Kostkan 4
- Pascale Feldkamp 2
- Gianluca Barmina 1
- Katrine Baunvig 1
- Yuri Bizzoni 1
- Per Møldrup Dalum 1
- Desmond Elliott 1
- Lukas Galke Poech 1
- Rob Van Der Goot 1
- Johan Heinsen 1
- Kristian Nørgaard Jensen 1
- Rasmus Larsen 1
- Alie Lassche 1
- Jacob Palle Bliddal Mortensen 1
- Kristoffer Laigaard Nielbo 1
- Jacob Nielsen 1
- Andrea Blasi Núñez 1
- Roberta Rocca 1
- Peter Schneider-Kamp 1
- Balázs Szabó 1
- Kirsten Vad 1
- Peter Vahlstrup 1
- Arnault-Quentin Vermillet 1