Kristina Vuckovic
Also published as: Kristina Vučković
2010
Improving Chunking Accuracy on Croatian Texts by Morphosyntactic Tagging
Kristina Vučković
|
Željko Agić
|
Marko Tadić
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper, we present the results of an experiment with utilizing a stochastic morphosyntactic tagger as a pre-processing module of a rule-based chunker and partial parser for Croatian in order to raise its overall chunking and partial parsing accuracy on Croatian texts. In order to conduct the experiment, we have manually chunked and partially parsed 459 sentences from the Croatia Weekly 100 kw newspaper sub-corpus taken from the Croatian National Corpus, that were previously also morphosyntactically disambiguated and lemmatized. Due to the lack of resources of this type, these sentences were designated as a temporary chunking and partial parsing gold standard for Croatian. We have then evaluated the chunker and partial parser in three different scenarios: (1) chunking previously morphosyntactically untagged text, (2) chunking text that was tagged using the stochastic morphosyntactic tagger for Croatian and (3) chunking manually tagged text. The obtained F1-scores for the three scenarios were, respectively, 0.874 (P: 0.825, R: 0.930), 0.891 (P: 0.856, R: 0.928) and 0.914 (P: 0.904, R: 0.925). The paper provides the description of language resources and tools used in the experiment, its setup and discussion of results and perspectives for future work.
2008
Rule-Based Chunker for Croatian
Kristina Vučković
|
Marko Tadić
|
Zdravko Dovedan
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
In this paper we discuss a rule-based approach to chunking sentences in Croatian, implemented using local regular grammars within the NooJ development environment. We describe the rules and their implementation by regular grammars and at the same time show that in NooJ environment it is extremely easy to fine tune their different sub-rules. Since Croatian has strong morphosyntactic features that are shared between most or all elements of a chunk, the rules are built by taking these features into account and strongly relying on them. For the evaluation of our chunker we used a extracted set of manually annotated sentences from 100 kw MSD/tagged and disambiguated Croatian corpus. Our chunker performed the best on VP-chunks (F: 97.01), while NP-chunks (F: 92.31) and PP-chunks (F: 83.08) were of lower quality. The results are comparable to chunker performance of CoNLL-2000 shared task of chunking.