SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech
Barbora Štěpánková, Michal Novák, Tomáš Musil, Lucie Polakova
Abstract
We present a project focused on linguistic description, annotation and automatic classification of the so-called epistemic markers in Czech. These expressions, such as pravděpodobně ‘probably’, zřejmě ‘apparently’ and určitě ‘certainly’, typically operate within the pragmatic domain of language. We introduce a dataset containing manual annotations of the 40 most frequent epistemic markers in Czech, totalling almost 4,000 uses. This annotation was created using parallel InterCorp data (in Czech and English) and the TEITOK tool. We describe the annotation scheme used, the annotation process and data handling. The dataset forms the core of the emerging lexical database of these expressions (SEEMLex). Thanks to the comprehensive manual annotation, the dataset can also serve as a source of further pragmatic information and can be used as a basis for further linguistic research. The proposed annotation scheme can also be used for other languages. To demonstrate the dataset’s utility for automatic classification, we trained XLM-RoBERTa classifiers using 10-fold cross-validation, achieving 72.6% accuracy for type of use classification (6 classes) and 54.2% accuracy for degree of certainty classification (4 classes).- Anthology ID:
- 2026.lrec-main.545
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 6856–6869
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.545/
- DOI:
- Cite (ACL):
- Barbora Štěpánková, Michal Novák, Tomáš Musil, and Lucie Polakova. 2026. SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech. International Conference on Language Resources and Evaluation, main:6856–6869.
- Cite (Informal):
- SEEM-CZ: Annotation and Classification of Epistemic Markers in Czech (Štěpánková et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.545.pdf