Introducing, evaluating ukWaC, a very large web-derived corpus of English
Adriano Ferraresi, Eros Zanchetta, Marco Baroni, Silvia Bernardini
Abstract
In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens, is one of the largest freely available linguistic resources for English. The paper describes the tools, methodology used in the construction of the corpus, provides a qualitative evaluation of its contents, carried out through a vocabulary-based comparison with the BNC. We conclude by giving practical information about availability, format of the corpus.- Anthology ID:
- 2008.wac-1.8
- Volume:
- Proceedings of the 4th Web as Corpus Workshop
- Month:
- June
- Year:
- 2008
- Address:
- Marrakech, Morocco
- Editors:
- Stefan Evert, Adam Kilgarriff, Serge Sharoff
- Venues:
- WAC | WS
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 47–54
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.8/
- DOI:
- Cite (ACL):
- Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing, evaluating ukWaC, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop, pages 47–54, Marrakech, Morocco. European Language Resources Association.
- Cite (Informal):
- Introducing, evaluating ukWaC, a very large web-derived corpus of English (Ferraresi et al., WAC 2008)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.8.pdf