Introducing, evaluating ukWaC, a very large web-derived corpus of English

Adriano Ferraresi, Eros Zanchetta, Marco Baroni, Silvia Bernardini


Abstract
In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens, is one of the largest freely available linguistic resources for English. The paper describes the tools, methodology used in the construction of the corpus, provides a qualitative evaluation of its contents, carried out through a vocabulary-based comparison with the BNC. We conclude by giving practical information about availability, format of the corpus.
Anthology ID:
2008.wac-1.8
Volume:
Proceedings of the 4th Web as Corpus Workshop
Month:
June
Year:
2008
Address:
Marrakech, Morocco
Editors:
Stefan Evert, Adam Kilgarriff, Serge Sharoff
Venues:
WAC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
47–54
Language:
URL:
https://preview.aclanthology.org/jlcl-multiple-ingestion/2008.wac-1.8/
DOI:
Bibkey:
Cite (ACL):
Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing, evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop, pages 47–54, Marrakech, Morocco. European Language Resources Association.
Cite (Informal):
Introducing, evaluating ukWaC, a very large web-derived corpus of English (Ferraresi et al., WAC 2008)
Copy Citation:
PDF:
https://preview.aclanthology.org/jlcl-multiple-ingestion/2008.wac-1.8.pdf