Google for the Linguist on a Budget

András Kornai, Péter Halácsy


Abstract
In this paper, we present GLB, yet another open source, free system to create, exploit linguistic corpora gathered from the web. A simple, robust web crawl algorithm, a multi-dimensional information retrieval tool„ a crude parallelization mechanism are proposed, especially for researchers working in resource-limited environments.
Anthology ID:
2008.wac-1.2
Volume:
Proceedings of the 4th Web as Corpus Workshop
Month:
June
Year:
2008
Address:
Marrakech, Morocco
Editors:
Stefan Evert, Adam Kilgarriff, Serge Sharoff
Venues:
WAC | WS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
8–11
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.2/
DOI:
Bibkey:
Cite (ACL):
András Kornai and Péter Halácsy. 2008. Google for the Linguist on a Budget. In Proceedings of the 4th Web as Corpus Workshop, pages 8–11, Marrakech, Morocco. European Language Resources Association.
Cite (Informal):
Google for the Linguist on a Budget (Kornai & Halácsy, WAC 2008)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.2.pdf