Victor: the Web-Page Cleaning Tool

Miroslav Spousta, Michal Marek, Pavel Pecina


Abstract
In this paper we present a complete solution for automatic cleaning of arbitrary HTML pages with a goal of using web data as a corpus in the area of natural language processing, computational linguistics. We employ a sequence-labeling approach based on Conditional Random Fields (CRF). Every block of text in analyzed web page is assigned a set of features extracted from the textual content, HTML structure of the page. The blocks are automatically labeled either as content segments containing main web page content, which should be preserved, or as noisy segments not suitable for further linguistic processing, which should be eliminated. Our solution is based on the tool introduced at the CLEANEVAL 2007 shared task workshop. In this paper, we present new CRF features, a handy annotation tool„ new evaluation metrics. Evaluation itself is performed on a random sample of web pages automatically downloaded from the Czech web domain.
Anthology ID:
2008.wac-1.3
Volume:
Proceedings of the 4th Web as Corpus Workshop
Month:
June
Year:
2008
Address:
Marrakech, Morocco
Editors:
Stefan Evert, Adam Kilgarriff, Serge Sharoff
Venues:
WAC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
12–17
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.3/
DOI:
Bibkey:
Cite (ACL):
Miroslav Spousta, Michal Marek, and Pavel Pecina. 2008. Victor: the Web-Page Cleaning Tool. In Proceedings of the 4th Web as Corpus Workshop, pages 12–17, Marrakech, Morocco. European Language Resources Association.
Cite (Informal):
Victor: the Web-Page Cleaning Tool (Spousta et al., WAC 2008)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.3.pdf