Abstract
Over the last decade, methods of web corpus construction and the evaluation of web corpora have been actively researched. Prominently, the WaCky initiative has provided both theoretical results and a set of web corpora for selected European languages. We present a software toolkit for web corpus construction and a set of siginificantly larger corpora (up to over 9 billion tokens) built using this software. First, we discuss how the data should be collected to ensure that it is not biased towards certain hosts. Then, we describe our software toolkit which performs basic cleanups as well as boilerplate removal, simple connected text detection as well as shingling to remove duplicates from the corpora. We finally report evaluation results of the corpora built so far, for example w.r.t. the amount of duplication contained and the text type/genre distribution. Where applicable, we compare our corpora to the WaCky corpora, since it is inappropriate, in our view, to compare web corpora to traditional or balanced corpora. While we use some methods applied by the WaCky initiative, we can show that we have introduced incremental improvements.- Anthology ID:
- L12-1497
- Volume:
- Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
- Month:
- May
- Year:
- 2012
- Address:
- Istanbul, Turkey
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 486–493
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/834_Paper.pdf
- DOI:
- Cite (ACL):
- Roland Schäfer and Felix Bildhauer. 2012. Building Large Corpora from the Web Using a New Efficient Tool Chain. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 486–493, Istanbul, Turkey. European Language Resources Association (ELRA).
- Cite (Informal):
- Building Large Corpora from the Web Using a New Efficient Tool Chain (Schäfer & Bildhauer, LREC 2012)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2012/pdf/834_Paper.pdf