Syntactic units and their length distributions: A case study in Czech

Michaela Nogolová, Michaela Koščová, Jan Macutek, Radek Cech


Abstract
This study investigates the length distributions of syntactic units in Czech across multiple hierarchical levels: sentences, independent clauses, clauses, phrases, subphrases, and chunks. Using a diverse dataset – including Universal Dependency treebanks, presidential speeches, the Czech Bible, and random sample from corpora of modern Czech – the analysis examines whether lengths of these syntactic units follow consistent distributional patterns. Length is defined as the number of immediate subunits, and the distributions were modeled using the hyper-Poisson distribution. The results demonstrate that the hyper-Poisson model fits well distributions of length of all abovementioned syntactic units, pointing to a common principle underlying the organization of syntactic structure in Czech.
Anthology ID:
2025.quasy-1.14
Volume:
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)
Month:
August
Year:
2025
Address:
Ljubljana, Slovenia
Editors:
Xinying Chen, Yaqin Wang
Venues:
Quasy | WS | SyntaxFest
SIG:
SIGPARSE
Publisher:
Association for Computational Linguistics
Note:
Pages:
115–123
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.quasy-1.14/
DOI:
Bibkey:
Cite (ACL):
Michaela Nogolová, Michaela Koščová, Jan Macutek, and Radek Cech. 2025. Syntactic units and their length distributions: A case study in Czech. In Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025), pages 115–123, Ljubljana, Slovenia. Association for Computational Linguistics.
Cite (Informal):
Syntactic units and their length distributions: A case study in Czech (Nogolová et al., Quasy-SyntaxFest 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.quasy-1.14.pdf