Abstract
This paper attempts a preliminary interpretation of the occurrence of different types of linguistic constructs in the manually-annotated Polish Coreference Corpus by providing analyses of various statistical properties related to mentions, clusters and near-identity links. Among others, frequency of mentions, zero subjects and singleton clusters is presented, as well as the average mention and cluster size. We also show that some coreference clustering constraints, such as gender or number agreement, are frequently not valid in case of Polish. The need for lemmatization for automatic coreference resolution is supported by an empirical study. Correlation between cluster and mention count within a text is investigated, with short characteristics of outlier cases. We also examine this correlation in each of the 14 text domains present in the corpus and show that none of them has abnormal frequency of outlier texts regarding the cluster/mention ratio. Finally, we report on our negative experiences concerning the annotation of the near-identity relation. In the conclusion we put forward some guidelines for the future research in the area.- Anthology ID:
- L14-1066
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3234–3238
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1088_Paper.pdf
- DOI:
- Cite (ACL):
- Maciej Ogrodniczuk, Mateusz Kopeć, and Agata Savary. 2014. Polish Coreference Corpus in Numbers. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3234–3238, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Polish Coreference Corpus in Numbers (Ogrodniczuk et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1088_Paper.pdf