Tanja Purtonen

2012

pdf abs
Rule-Based Detection of Clausal Coordinate Ellipsis
Kristiina Muhonen | Tanja Purtonen
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

With our experiment, we show how we can detect and annotate clausal coordinate ellipsis with Constraint Grammar rules. We focus on such an elliptical structure in which there are two coordinated clauses, and the latter one lacks a verb. For example, the sentence This belongs to me and that to you demonstrates the ellipsis in question, namely gapping. The Constraint Grammar rules are made for a Finnish parsebank, FinnTreeBank. The FinnTreeBank project is building a parsebank in the dependency syntactic framework in which verbs are central since other sentence elements depend on them. Without correct detection of omitted verbs, the syntactic analysis of the whole sentence fails. In the experiment, we detect gapping based on morphology and linear order of the words without using syntactic or semantic information. The test corpus, Finnish Wikipedia, is morphologically analyzed but not disambiguated. Even with an ambiguous morphological analysis, the results show that 89,9% of the detected sentences are elliptical, making the rules accurate enough to be used in the creation of FinnTreeBank. Once we have a morphologically disambiguated corpus, we can write more accurate rules and expect better results.

pdf abs
Specifying Treebanks, Outsourcing Parsebanks: FinnTreeBank 3
Atro Voutilainen | Kristiina Muhonen | Tanja Purtonen | Krister Lindén
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Corpus-based treebank annotation is known to result in incomplete coverage of mid- and low-frequency linguistic constructions: the linguistic representation and corpus annotation quality are sometimes suboptimal. Large descriptive grammars cover also many mid- and low-frequency constructions. We argue for use of large descriptive grammars and their sample sentences as a basis for specifying higher-coverage grammatical representations. We present an sample case from an ongoing project (FIN-CLARIN FinnTreeBank) where an grammatical representation is documented as an annotator's manual alongside manual annotation of sample sentences extracted from a large descriptive grammar of Finnish. We outline the linguistic representation (morphology and dependency syntax) for Finnish, and show how the resulting `Grammar Definition Corpus' and the documentation is used as a task specification for an external subcontractor for building a parser engine for use in morphological and dependency syntactic analysis of large volumes of Finnish for parsebanking purposes. The resulting corpus, FinnTreeBank 3, is due for release in June 2012, and will contain tens of millions of words from publicly available corpora of Finnish with automatic morphological and dependency syntactic analysis, for use in research on the corpus linguistics and language engineering.

2011

pdf
A double-blind experiment on interannotator agreement: the case of dependency syntax and Finnish
Atro Voutilainen | Tanja Purtonen
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)

Co-authors

Kaarlo Voionmaa 1

Krister Lindén 1

Tanja Purtonen

2012

2011

Co-authors

Venues