Current Challenges in Web Corpus Building

Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel


Abstract
In this paper we discuss some of the current challenges in web corpus building that we faced in the recent years when expanding the corpora in Sketch Engine. The purpose of the paper is to provide an overview and raise discussion on possible solutions, rather than bringing ready solutions to the readers. For every issue we try to assess its severity and briefly discuss possible mitigation options.
Anthology ID:
2020.wac-1.1
Volume:
Proceedings of the 12th Web as Corpus Workshop
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
WAC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1–4
Language:
English
URL:
https://aclanthology.org/2020.wac-1.1
DOI:
Bibkey:
Cite (ACL):
Miloš Jakubíček, Vojtěch Kovář, Pavel Rychlý, and Vit Suchomel. 2020. Current Challenges in Web Corpus Building. In Proceedings of the 12th Web as Corpus Workshop, pages 1–4, Marseille, France. European Language Resources Association.
Cite (Informal):
Current Challenges in Web Corpus Building (Jakubíček et al., WAC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2020.wac-1.1.pdf