2024
pdf
abs
Man or Machine: Evaluating Spelling Error Detection in Danish Newspaper Corpora
Eckhard Bick
|
Jonas Nygaard Blom
|
Marianne Rathje
|
Jørgen Schack
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
This paper evaluates frequency and detection performance for both spelling and grammatical errors in a corpus of published Danish newspaper texts, comparing the results of three human proofreaders with those of an automatic system, DanProof. Adopting the error categorization scheme of the latter, we look at the accuracy of individual error types and their relative distribution over time, as well as the adequacy of suggested corrections. Finally, we discuss so-called artefact errors introduced by corpus processing, and the potential of DanProof as a corpus cleaning tool for identifying and correcting format conversion, OCR or other compilation errors. In the evaluation, with balanced F1-scores of 77.6 and 67.6 for 1999 texts and 2019 texts, respectively, DanProof achieved a higher recall and accuracy than the individual human annotators, and contributed the largest share of errors not detected by others (16.4% for 1999 and 23.6% for 2019). However, the human annotators had a significantly higher precision. Not counting artifacts, the overall error frequency in the corpus was low ( 0.5%), and less than half in the newer texts compared to the older ones, a change that mostly concerned orthographical errors, with a correspondingly higher relative share of grammatical errors.
2023
pdf
Linking Danish Parser Output to a Central Word Repository - From Morphosemantic Disambiguation to Unique Identifiers
Eckhard Bick
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)
pdf
bib
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications
Eckhard Bick
|
Trond Trosterud
|
Tanel Alumäe
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications
pdf
bib
abs
Attribution of Quoted Speech in Portuguese Text
Eckhard Bick
Proceedings of the NoDaLiDa 2023 Workshop on Constraint Grammar - Methods, Tools and Applications
This paper describes and evaluates a rule-based system implementing a novel method for quote attribution in Portuguese text, working on top of a Constraint-Grammar parse. Both direct and indirect speech are covered, as well as certain other text- embedded quote sources. In a first step, the system performs quote segmentation and identifies speech verbs, taking into account the different styles used in literature and news text. Speakers are then identified using syntactically and semantically grounded Constraint-Grammar rules. We rely on relational links and stream variables to handle anaphorical mentions and to recover the names of implied or underspecified speakers. In an evaluation including both literature and news text, the system performed well on both the segmentation and attribution tasks, achieving F-scores of 98-99% for the former and 89-94% for the latter.
2022
pdf
Lemma Hunting: Automatic Spelling Normalization for German CMC Corpora
Eckhard Bick
Proceedings of the 18th Conference on Natural Language Processing (KONVENS 2022)
pdf
abs
A Framenet and Frame Annotator for German Social Media
Eckhard Bick
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper presents PFN-DE, a new, parsing- and annotation-oriented framenet for German, with almost 15,000 frames, covering 11,300 verb lemmas. The resource was developed in the context of a Danish/German social-media study on hate speech and has a strong focus on coverage, robustness and cross-language comparability. A simple annotation scheme for argument roles meshes directly with the output of a syntactic parser, facilitating frame disambiguation through slot-filler conditions based on valency, syntactic function and semantic noun class. We discuss design principles for the framenet and the frame tagger using it, and present statistics for frame and role distribution at both the lexicon (type) and corpus (token) levels. In an evaluation run on Twitter data, the parser-based frame annotator achieved an overall F-score for frame senses of 93.6%.
2020
pdf
abs
Syntax and Semantics in a Treebank for Esperanto
Eckhard Bick
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper we describe and evaluate syntactic and semantic aspects of Arbobanko, a treebank for the artificial language Esperanto, as well as tools and methods used in the production of the treebank. In addition to classical morphosyntax and dependency structure, the treebank was enriched with a lexical-semantic layer covering named entities, a semantic type ontology for nouns and adjectives and a framenet-inspired semantic classification of verbs. For an under-resourced language, the quality of automatic syntactic and semantic pre-annotation is of obvious importance, and by evaluating the underlying parser and the coverage of its semantic ontologies, we try to answer the question whether the language’s extremely regular morphology and transparent semantic affixes translate into a more regular syntax and higher parsing accuracy. On the linguistic side, the treebank allows us to address and quantify typological issues such as the question of word order, auxiliary constructions, lexical transparency and semantic type ambiguity in Esperanto.
pdf
abs
An Annotated Social Media Corpus for German
Eckhard Bick
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper presents the German Twitter section of a large (2 billion word) bilingual Social Media corpus for Hate Speech research, discussing the compilation, pseudonymization and grammatical annotation of the corpus, as well as special linguistic features and peculiarities encountered in the data. Among other things, compounding, accidental and intentional orthographic variation, gendering and the use of emoticons/emojis are addressed in a genre-specific fashion. We present the different layers of linguistic annotation (morphosyntactic, dependencies and semantic types) and explain how a general parser (GerGram) can be made to work on Social Media data, pointing out necessary adaptations and extensions. In an evaluation run on a random cross-section of tweets, the modified parser achieved F-scores of 97% for morphology (fine-grained POS) and 92% for syntax (labeled attachment score). Predictably, performance was twice as good in tweets with standard orthography than in tweets with spelling/casing irregularities or lack of sentence separation, the effect being more marked for morphology than for syntax.
2019
pdf
abs
A Semantic Ontology of Danish Adjectives
Eckhard Bick
Proceedings of the 13th International Conference on Computational Semantics - Long Papers
This paper presents a semantic annotation scheme for Danish adjectives, focusing both on prototypical semantic content and semantic collocational restrictions on an adjective’s head noun. The core type set comprises about 110 categories ordered in a shallow hierarchy with 14 primary and 25 secondary umbrella categories. In addition, domain information and binary sentiment tags are provided, as well as VerbNet-derived frames and semantic roles for those adjectives governing arguments. The scheme has been almost fully implemented on the lexicon of the Danish VISL parser, DanGram, containing 14,000 adjectives. We discuss the annotation scheme and its applicational perspectives, and present a statistical breakdown and coverage evaluation for three Danish reference corpora.
pdf
bib
Automatic Generation and Semantic Grading of Esperanto Sentences in a Teaching Context
Eckhard Bick
Proceedings of the 8th Workshop on NLP for Computer Assisted Language Learning
2017
pdf
From Treebank to Propbank: A Semantic-Role and VerbNet Corpus for Danish
Eckhard Bick
Proceedings of the 21st Nordic Conference on Computational Linguistics
pdf
Universal Dependencies for Portuguese
Alexandre Rademaker
|
Fabricio Chalub
|
Livy Real
|
Cláudia Freitas
|
Eckhard Bick
|
Valeria de Paiva
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
pdf
bib
Propbank Annotation of Danish Noun Frames
Eckhard Bick
Proceedings of the 12th International Conference on Computational Semantics (IWCS) — Short papers
2016
pdf
abs
A Morphological Lexicon of Esperanto with Morpheme Frequencies
Eckhard Bick
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper discusses the internal structure of complex Esperanto words (CWs). Using a morphological analyzer, possible affixation and compounding is checked for over 50,000 Esperanto lexemes against a list of 17,000 root words. Morpheme boundaries in the resulting analyses were then checked manually, creating a CW dictionary of 28,000 words, representing 56.4% of the lexicon, or 19.4% of corpus tokens. The error percentage of the EspGram morphological analyzer for new corpus CWs was 4.3% for types and 6.4% for tokens, with a recall of almost 100%, and wrong/spurious boundaries being more common than missing ones. For pedagogical purposes a morpheme frequency dictionary was constructed for a 16 million word corpus, confirming the importance of agglutinative derivational morphemes in the Esperanto lexicon. Finally, as a means to reduce the morphological ambiguity of CWs, we provide POS likelihoods for Esperanto suffixes.
pdf
Constraint Grammar-based conversion of Dependency Treebanks
Eckhard Bick
Proceedings of the 13th International Conference on Natural Language Processing
2015
pdf
DanProof: Pedagogical Spell and Grammar Checking for Danish
Eckhard Bick
Proceedings of the International Conference Recent Advances in Natural Language Processing
pdf
CG-3 — Beyond Classical Constraint Grammar
Eckhard Bick
|
Tino Didriksen
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)
pdf
WikiTrans: Swedish-Danish Machine Translation in a Constraint Grammar Framework
Eckhard Bick
Proceedings of the Joint Workshop on Language Technology for Closely Related Languages, Varieties and Dialects
2014
pdf
abs
ML-Optimization of Ported Constraint Grammars
Eckhard Bick
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we describe how a Constraint Grammar with linguist-written rules can be optimized and ported to another language using a Machine Learning technique. The effects of rule movements, sorting, grammar-sectioning and systematic rule modifications are discussed and quantitatively evaluated. Statistical information is used to provide a baseline and to enhance the core of manual rules. The best-performing parameter combinations achieved part-of-speech F-scores of over 92 for a grammar ported from English to Danish, a considerable advance over both the statistical baseline (85.7), and the raw ported grammar (86.1). When the same technique was applied to an existing native Danish CG, error reduction was 10% (F=96.94).
2013
pdf
Using Constraint Grammar for Chunking
Eckhard Bick
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)
pdf
ML-Tuned Constraint Grammars
Eckhard Bick
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)
2012
pdf
Tailored Feature Extraction for Lexical Disambiguation of English Verbs Based on Corpus Pattern Analysis
Martin Holub
|
Vincent Kríž
|
Silvie Cinková
|
Eckhard Bick
Proceedings of COLING 2012
pdf
abs
The annotation of the C-ORAL-BRASIL oral through the implementation of the Palavras Parser
Eckhard Bick
|
Heliana Mello
|
Alessandro Panunzi
|
Tommaso Raso
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This article describes the morphosyntactic annotation of the C-ORAL-BRASIL speech corpus, using an adapted version of the Palavras parser. In order to achieve compatibility with annotation rules designed for standard written Portuguese, transcribed words were orthographically normalized, and the parsing lexicon augmented with speech-specific material, phonetically spelled abbreviations etc. Using a two-level annotation approach, speech flow markers like overlaps, retractions and non-verbal productions were separated from running, annotatable text. In the absence of punctuation, syntactic segmentation was achieved by exploiting prosodic break markers, enhanced by a rule-based distinctions between pause and break functions. Under optimal conditions, the modified parsing system achieved correctness rates (F-scores) of 98.6% for part of speech, 95% for syntactic function and 99% for lemmatization. Especially at the syntactic level, a clear connection between accessibility of prosodic break markers and annotation performance could be documented.
pdf
Towards a Semantic Annotation of English Television News - Building and Evaluating a Constraint Grammar FrameNet
Eckhard Bick
Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation
2011
pdf
A FrameNet for Danish
Eckhard Bick
Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011)
pdf
A Bare-bones Constraint Grammar
Eckhard Bick
Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation
2010
pdf
abs
FrAG, a Hybrid Constraint Grammar Parser for French
Eckhard Bick
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes a hybrid system (FrAG) for tagging / parsing French text, and presents results from ongoing development work, corpus annotation and evaluation. The core of the system is a sentence scope Constraint Grammar (CG), with linguist-written rules. However, unlike traditional CG, the system uses hybrid techniques on both its morphological input side and its syntactic output side. Thus, FrAG draws on a pre-existing probabilistic Decision Tree Tagger (DTT) before and in parallel with its own lexical stage, and feeds its output into a Phrase Structure Grammar (PSG) that uses CG syntactic function tags rather than ordinary terminals in its rewriting rules. As an alternative architecture, dependency tree structures are also supported. In the newest version, dependencies are assigned within the CG-framework itself, and can interact with other rules. To provide semantic context, a semantic prototype ontology for nouns is used, covering a large part of the lexicon. In a recent test run on Parliamentary debate transcripts, FrAG achieved F-scores of 98.7 % for part of speech (PoS) and between 93.1 % and 96.2 % for syntactic function tags. Dependency links were correct in 95.9 %.
pdf
Degrees of Orality in Speech-like Corpora: Comparative Annotation of Chat and E-mail Corpora
Eckhard Bick
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation
2009
pdf
bib
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
Kristiina Jokinen
|
Eckhard Bick
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
pdf
Automatic Semantic Role Annotation for Spanish
Eckhard Bick
|
M. Pilar Valverde Ibáñez
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
pdf
DeepDict–A Graphical Corpus-based Dictionary of Word Relations
Eckhard Bick
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)
2007
pdf
Hybrid Ways to Improve Domain Independence in an ML Dependency Parser
Eckhard Bick
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)
pdf
Using Danish as a CG Interlingua: A Wide-Coverage Norwegian-English Machine Translation System
Eckhard Bick
|
Lars Nygaard
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)
2006
pdf
abs
Turning a Dependency Treebank into a PSG-style Constituent Treebank
Eckhard Bick
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
In this paper, we present and evaluate a new method to convert Constraint Grammar (CG) parses of running text into Constituent Treebanks. The conversion is two-step - first a grammar-based method is used to bridge the gap between raw CG annotation and full dependency structure, then phrase structure bracketing and non-terminal nodes are introduced by clustering sister dependents, effectively building one syntactic treebank on top of another. The method is compared with another approach (Bick 2003-2), where constituent structures are arrived at by employing a function-tag based Phrase Structure Grammar (PSG). Results are evaluated on a small reference corpus for both raw and revised CG input, with bracketing F-Scores of 87.5% for raw text and 97.1% for revised CG input, and a raw text edge label accuracy of 95.9% for forms and 86% for functions, or 99.7% and 99.4%, respectively, for revised CG. By applying the tools to the CG-only part of the Danish Arboretum treebank we were able to increase the size of the treebank by 86%, from 197.400 to 367.500 words.
pdf
Semantic tagging for resolution of indirect anaphora
R. Vieira
|
E. Bick
|
J. Coelho
|
V. Muller
|
S. Collovini
|
J. Souza
|
L. Rino
Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue
pdf
LingPars, a Linguistically Inspired, Language-Independent Machine Learner for Dependency Treebanks
Eckhard Bick
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)
2004
pdf
abs
La FREEBANK : vers une base libre de corpus annotés
Susanne Salmon-Alt
|
Eckhard Bick
|
Laurent Romary
|
Jean-Marie Pierrel
Actes de la 11ème conférence sur le Traitement Automatique des Langues Naturelles. Articles longs
Les corpus français librement accessibles annotés à d’autres niveaux linguistiques que morpho-syntaxique sont insuffisants à la fois quantitativement et qualitativement. Partant de ce constat, la FREEBANK – construite sur la base d’outils d’analyse automatique dont la sortie est révisée manuellement – se veut une base de corpus du français annotés à plusieurs niveaux (structurel, morphologique, syntaxique, coréférentiel) et à différents degrés de finesse linguistique qui soit libre d’accès, codée selon des schémas normalisés, intégrant des ressources existantes et ouverte à l’enrichissement progressif.
pdf
A Named Entity Recognizer for Danish
Eckhard Bick
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2002
pdf
bib
Floresta Sintá(c)tica: A treebank for Portuguese
Susana Afonso
|
Eckhard Bick
|
Renato Haber
|
Diana Santos
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2001
pdf
bib
The VISL System: Research and applicative aspects of IT-based learning
Eckhard Bick
Proceedings of the 13th Nordic Conference of Computational Linguistics (NODALIDA 2001)
2000
pdf
Providing Internet Access to Portuguese Corpora: the AC/DC Project
Diana Santos
|
Eckhard Bick
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
1998
pdf
Structural Lexical Heuristics in the Automatic Analysis of Portuguese
Eckhard Bick
Proceedings of the 11th Nordic Conference of Computational Linguistics (NODALIDA 1998)