Adam Ussishkin
2014
ACTIV-ES: a comparable, cross-dialect corpus of ‘everyday’ Spanish from Argentina, Mexico, and Spain
Jerid Francom
|
Mans Hulden
|
Adam Ussishkin
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Corpus resources for Spanish have proved invaluable for a number of applications in a wide variety of fields. However, a majority of resources are based on formal, written language and/or are not built to model language variation between varieties of the Spanish language, despite the fact that most language in everyday use is informal/ dialogue-based and shows rich regional variation. This paper outlines the development and evaluation of the ACTIV-ES corpus, a first-step to produce a comparable, cross-dialect corpus representative of the everyday language of various regions of the Spanish-speaking world.
2010
How Specialized are Specialized Corpora? Behavioral Evaluation of Corpus Representativeness for Maltese.
Jerid Francom
|
Amy LaCross
|
Adam Ussishkin
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
In this paper we bring to light a novel intersection between corpus linguistics and behavioral data that can be employed as an evaluation metric for resources for low-density languages, drawing on well-established psycholinguistic factors. Using the low-density language Maltese as a test case, we highlight the challenges that face researchers developing resources for languages with sparsely available data and identify a key empirical link between corpus and psycholinguistic research as a tool to evaluate corpus resources. Specifically, we compare two robust variables identified in the psycholinguistic literature: word frequency (as measured in a corpus) and word familiarity (as measured in a rating task). We then apply statistical methods to evaluate the extent to which familiarity ratings predict corpus frequency for verbs in the Maltese corpus from three angles: 1) token frequency, 2) frequency distributions and 3) morpho-syntactic type (binyan). This research provides a multidisciplinary approach to corpus development and evaluation, in particular for less-resourced languages that lack a wide access to diverse language data.