2016
pdf
abs
NorGramBank: A ‘Deep’ Treebank for Norwegian
Helge Dyvik
|
Paul Meurer
|
Victoria Rosén
|
Koenraad De Smedt
|
Petter Haugereid
|
Gyri Smørdal Losnegaard
|
Gunn Inger Lyse
|
Martha Thunes
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
We present NorGramBank, a treebank for Norwegian with highly detailed LFG analyses. It is one of many treebanks made available through the INESS treebanking infrastructure. NorGramBank was constructed as a parsebank, i.e. by automatically parsing a corpus, using the wide coverage grammar NorGram. One part consisting of 350,000 words has been manually disambiguated using computer-generated discriminants. A larger part of 50 M words has been stochastically disambiguated. The treebank is dynamic: by global reparsing at certain intervals it is kept compatible with the latest versions of the grammar and the lexicon, which are continually further developed in interaction with the annotators. A powerful query language, INESS Search, has been developed for search across formalisms in the INESS treebanks, including LFG c- and f-structures. Evaluation shows that the grammar provides about 85% of randomly selected sentences with good analyses. Agreement among the annotators responsible for manual disambiguation is satisfactory, but also suggests desirable simplifications of the grammar.
2014
pdf
abs
The Interplay Between Lexical and Syntactic Resources in Incremental Parsebanking
Victoria Rosén
|
Petter Haugereid
|
Martha Thunes
|
Gyri S. Losnegaard
|
Helge Dyvik
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Automatic syntactic analysis of a corpus requires detailed lexical and morphological information that cannot always be harvested from traditional dictionaries. In building the INESS Norwegian treebank, it is often the case that necessary lexical information is missing in the morphology or lexicon. The approach used to build the treebank is incremental parsebanking; a corpus is parsed with an existing grammar, and the analyses are efficiently disambiguated by annotators. When the intended analysis is unavailable after parsing, the reason is often that necessary information is not available in the lexicon. INESS has therefore implemented a text preprocessing interface where annotators can enter unrecognized words before parsing. This may concern words that are unknown to the morphology and/or lexicon, and also words that are known, but for which important information is missing. When this information is added, either during text preprocessing or during disambiguation, the result is that after reparsing the intended analysis can be chosen and stored in the treebank. The lexical information added to the lexicon in this way may be of great interest both to lexicographers and to other language technology efforts, and the enriched lexical resource being developed will be made available at the end of the project.
2012
pdf
Extracting Semantic Transfer Rules from Parallel Corpora with SMT Phrase Aligners
Petter Haugereid
|
Francis Bond
Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation
2011
pdf
Extracting Transfer Rules for Multiword Expressions from Parallel Corpora
Petter Haugereid
|
Francis Bond
Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World
pdf
A grammar design accommodating packed argument frame information on verbs
Petter Haugereid
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
2006
pdf
Functionality in grammar design
Anders Søgaard
|
Petter Haugereid
Proceedings of the 15th Nordic Conference of Computational Linguistics (NODALIDA 2005)