Introducing, evaluating ukWaC, a very large web-derived corpus of English

Adriano Ferraresi; Eros Zanchetta; Marco Baroni; Silvia Bernardini

Introducing, evaluating ukWaC, a very large web-derived corpus of English

Adriano Ferraresi, Eros Zanchetta, Marco Baroni, Silvia Bernardini

Abstract

In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens, is one of the largest freely available linguistic resources for English. The paper describes the tools, methodology used in the construction of the corpus, provides a qualitative evaluation of its contents, carried out through a vocabulary-based comparison with the BNC. We conclude by giving practical information about availability, format of the corpus.

Anthology ID:: 2008.wac-1.8
Volume:: Proceedings of the 4th Web as Corpus Workshop
Month:: June
Year:: 2008
Address:: Marrakech, Morocco
Editors:: Stefan Evert, Adam Kilgarriff, Serge Sharoff
Venues:: WAC | WS
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 47–54
Language:
URL:: https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.8/
DOI:
Bibkey:
Cite (ACL):: Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing, evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop, pages 47–54, Marrakech, Morocco. European Language Resources Association.
Cite (Informal):: Introducing, evaluating ukWaC, a very large web-derived corpus of English (Ferraresi et al., WAC 2008)
Copy Citation:
PDF:: https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.8.pdf

PDF Cite Search Fix data