SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech

Barbora Štěpánková, Michal Novák, Tomáš Musil, Lucie Polakova


Abstract
We present a project focused on linguistic description, annotation and automatic classification of the so-called epistemic markers in Czech. These expressions, such as pravděpodobně ‘probably’, zřejmě ‘apparently’ and určitě ‘certainly’, typically operate within the pragmatic domain of language. We introduce a dataset containing manual annotations of the 40 most frequent epistemic markers in Czech, totalling almost 4,000 uses. This annotation was created using parallel InterCorp data (in Czech and English) and the TEITOK tool. We describe the annotation scheme used, the annotation process and data handling. The dataset forms the core of the emerging lexical database of these expressions (SEEMLex). Thanks to the comprehensive manual annotation, the dataset can also serve as a source of further pragmatic information and can be used as a basis for further linguistic research. The proposed annotation scheme can also be used for other languages. To demonstrate the dataset’s utility for automatic classification, we trained XLM-RoBERTa classifiers using 10-fold cross-validation, achieving 72.6% accuracy for type of use classification (6 classes) and 54.2% accuracy for degree of certainty classification (4 classes).
Anthology ID:
2026.lrec-main.545
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
6856–6869
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.545/
DOI:
Bibkey:
Cite (ACL):
Barbora Štěpánková, Michal Novák, Tomáš Musil, and Lucie Polakova. 2026. SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech. International Conference on Language Resources and Evaluation, main:6856–6869.
Cite (Informal):
SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech (Štěpánková et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.545.pdf