I. San Vicente

2008

pdf bib abs
Collecting Basque specialized corpora from the web: language-specific performance tweaks, improving topic precision
I. Leturia | I. San Vicente | X. Saralegi | M. Lopez de Lacalle
Proceedings of the 4th Web as Corpus Workshop

The de facto standard process for collecting corpora from the Internet (with a given list of words, asking APIs of search engines for random combinations of them, downloading the returned pages) does not give very good precision when searching for texts on a certain topic., this precision is much worse when searching for corpora in the Basque language, due to certain properties inherent in the language, in the Basque web. The method proposed in this paper improves topic precision by using a sample mini-corpus as a basis for the process: the words to be used in the queries are automatically extracted from it„ a final topic-filtering step is performed using document-similarity measures with this sample corpus. We also describe the changes made to the usual process to adapt it to the peculiarities of Basque, alongside other adjustments to improve the general performance of the system, quality of the collected corpora.

Co-authors

Venues

wac1
ws1

Fix data

I. San Vicente

Fixing paper assignments

2008

Co-authors

Venues