2014
pdf
abs
ACTIV-ES: a comparable, cross-dialect corpus of ‘everyday’ Spanish from Argentina, Mexico, and Spain
Jerid Francom
|
Mans Hulden
|
Adam Ussishkin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Corpus resources for Spanish have proved invaluable for a number of applications in a wide variety of fields. However, a majority of resources are based on formal, written language and/or are not built to model language variation between varieties of the Spanish language, despite the fact that most language in everyday use is informal/ dialogue-based and shows rich regional variation. This paper outlines the development and evaluation of the ACTIV-ES corpus, a first-step to produce a comparable, cross-dialect corpus representative of the everyday language of various regions of the Spanish-speaking world.
2013
pdf
Finite State Applications with Javascript
Mans Hulden
|
Miikka Silfverberg
|
Jerid Francom
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)
2012
pdf
abs
Boosting statistical tagger accuracy with simple rule-based grammars
Mans Hulden
|
Jerid Francom
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
We report on several experiments on combining a rule-based tagger and a trigram tagger for Spanish. The results show that one can boost the accuracy of the best performing n-gram taggers by quickly developing a rough rule-based grammar to complement the statistically induced one and then combining the output of the two. The specific method of combination is crucial for achieving good results. The method provides particularly large gains in accuracy when only a small amount of tagged data is available for training a HMM, as may be the case for lesser-resourced and minority languages.
2010
pdf
abs
How Specialized are Specialized Corpora? Behavioral Evaluation of Corpus Representativeness for Maltese.
Jerid Francom
|
Amy LaCross
|
Adam Ussishkin
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then apply statistical methods to evaluate the extent to which familiarity ratings predict corpus frequency for verbs in the Maltese corpus from three angles: 1) token frequency, 2) frequency distributions and 3) morpho-syntactic type (binyan). This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.
2008
pdf
abs
Parallel Multi-Theory Annotations of Syntactic Structure
Jerid Francom
|
Mans Hulden
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
We present an approach to creating a treebank of sentences using multiple notations or linguistic theories simultaneously. We illustrate the method by annotating sentences from the Penn Treebank II in three different theories in parallel: the original PTB notation, a Functional Dependency Grammar notation, and a Government and Binding style notation. Sentences annotated with all of these theories are represented in XML as a directed acyclic graph where nodes and edges may carry extra information depending on the theory encoded.