Syntactic units and their length distributions: A case study in Czech
Michaela Nogolová, Michaela Koščová, Jan Macutek, Radek Cech
Abstract
This study investigates the length distributions of syntactic units in Czech across multiple hierarchical levels: sentences, independent clauses, clauses, phrases, subphrases, and chunks. Using a diverse dataset – including Universal Dependency treebanks, presidential speeches, the Czech Bible, and random sample from corpora of modern Czech – the analysis examines whether lengths of these syntactic units follow consistent distributional patterns. Length is defined as the number of immediate subunits, and the distributions were modeled using the hyper-Poisson distribution. The results demonstrate that the hyper-Poisson model fits well distributions of length of all abovementioned syntactic units, pointing to a common principle underlying the organization of syntactic structure in Czech.- Anthology ID:
- 2025.quasy-1.14
- Volume:
- Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)
- Month:
- August
- Year:
- 2025
- Address:
- Ljubljana, Slovenia
- Editors:
- Xinying Chen, Yaqin Wang
- Venues:
- Quasy | WS | SyntaxFest
- SIG:
- SIGPARSE
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 115–123
- Language:
- URL:
- https://preview.aclanthology.org/mtsummit-25-ingestion/2025.quasy-1.14/
- DOI:
- Cite (ACL):
- Michaela Nogolová, Michaela Koščová, Jan Macutek, and Radek Cech. 2025. Syntactic units and their length distributions: A case study in Czech. In Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025), pages 115–123, Ljubljana, Slovenia. Association for Computational Linguistics.
- Cite (Informal):
- Syntactic units and their length distributions: A case study in Czech (Nogolová et al., Quasy-SyntaxFest 2025)
- PDF:
- https://preview.aclanthology.org/mtsummit-25-ingestion/2025.quasy-1.14.pdf