Collecting Basque specialized corpora from the web: language-specific performance tweaks, improving topic precision
I. Leturia, I. San Vicente, X. Saralegi, M. Lopez de Lacalle
Abstract
The de facto standard process for collecting corpora from the Internet (with a given list of words, asking APIs of search engines for random combinations of them, downloading the returned pages) does not give very good precision when searching for texts on a certain topic., this precision is much worse when searching for corpora in the Basque language, due to certain properties inherent in the language, in the Basque web. The method proposed in this paper improves topic precision by using a sample mini-corpus as a basis for the process: the words to be used in the queries are automatically extracted from it„ a final topic-filtering step is performed using document-similarity measures with this sample corpus. We also describe the changes made to the usual process to adapt it to the peculiarities of Basque, alongside other adjustments to improve the general performance of the system, quality of the collected corpora.- Anthology ID:
- 2008.wac-1.7
- Volume:
- Proceedings of the 4th Web as Corpus Workshop
- Month:
- June
- Year:
- 2008
- Address:
- Marrakech, Morocco
- Editors:
- Stefan Evert, Adam Kilgarriff, Serge Sharoff
- Venues:
- WAC | WS
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 40–46
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.7/
- DOI:
- Cite (ACL):
- I. Leturia, I. San Vicente, X. Saralegi, and M. Lopez de Lacalle. 2008. Collecting Basque specialized corpora from the web: language-specific performance tweaks, improving topic precision. In Proceedings of the 4th Web as Corpus Workshop, pages 40–46, Marrakech, Morocco. European Language Resources Association.
- Cite (Informal):
- Collecting Basque specialized corpora from the web: language-specific performance tweaks, improving topic precision (Leturia et al., WAC 2008)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2008.wac-1.7.pdf