Using Word Familiarities and Word Associations to Measure Corpus Representativeness

Reinhard Rapp


Abstract
The definition of corpus representativeness used here assumes that a representative corpus should reflect as well as possible the average language use a native speaker encounters in everyday life over a longer period of time. As it is not practical to observe people’s language input over years, we suggest to utilize two types of experimental data capturing two forms of human intuitions: Word familiarity norms and word association norms. If it is true that human language acquisition is corpus-based, such data should reflect people’s perceived language input. Assuming so, we compute a representativeness score for a corpus by extracting word frequency and word association statistics from it and by comparing these statistics to the human data. The higher the similarity, the more representative the corpus should be for the language environments of the test persons. We present results for five different corpora and for truncated versions thereof. The results confirm the expectation that corpus size and corpus balance are crucial aspects for corpus representativeness.
Anthology ID:
L14-1409
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2029–2036
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/492_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Reinhard Rapp. 2014. Using Word Familiarities and Word Associations to Measure Corpus Representativeness. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2029–2036, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Using Word Familiarities and Word Associations to Measure Corpus Representativeness (Rapp, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/492_Paper.pdf