Felix Bildhauer

Automatic Classification by Topic Domain for Meta Data Generation, Web Corpus Evaluation, and Corpus Comparison
Roland Schäfer | Felix Bildhauer
Proceedings of the 10th Web as Corpus Workshop

2014

pdf bib

Proceedings of the 9th Web as Corpus Workshop (WaC-9)
Felix Bildhauer | Roland Schäfer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

pdf bib

Focused Web Corpus Crawling
Roland Schäfer | Adrien Barbaresi | Felix Bildhauer
Proceedings of the 9th Web as Corpus Workshop (WaC-9)

2012

pdf bib abs

Building Large Corpora from the Web Using a New Efficient Tool Chain
Roland Schäfer | Felix Bildhauer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Over the last decade, methods of web corpus construction and the evaluation of web corpora have been actively researched. Prominently, the WaCky initiative has provided both theoretical results and a set of web corpora for selected European languages. We present a software toolkit for web corpus construction and a set of siginificantly larger corpora (up to over 9 billion tokens) built using this software. First, we discuss how the data should be collected to ensure that it is not biased towards certain hosts. Then, we describe our software toolkit which performs basic cleanups as well as boilerplate removal, simple connected text detection as well as shingling to remove duplicates from the corpora. We finally report evaluation results of the corpora built so far, for example w.r.t. the amount of duplication contained and the text type/genre distribution. Where applicable, we compare our corpora to the WaCky corpora, since it is inappropriate, in our view, to compare web corpora to traditional or balanced corpora. While we use some methods applied by the WaCky initiative, we can show that we have introduced incremental improvements.

Co-authors

Venues

Fix author

Felix Bildhauer

2020

2017

2016

2014

2012

Co-authors

Venues