I. San Vicente


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2008

pdf bib
Collecting Basque specialized corpora from the web: language-specific performance tweaks, improving topic precision
I. Leturia | I. San Vicente | X. Saralegi | M. Lopez de Lacalle
Proceedings of the 4th Web as Corpus Workshop

The de facto standard process for collecting corpora from the Internet (with a given list of words, asking APIs of search engines for random combinations of them, downloading the returned pages) does not give very good precision when searching for texts on a certain topic., this precision is much worse when searching for corpora in the Basque language, due to certain properties inherent in the language, in the Basque web. The method proposed in this paper improves topic precision by using a sample mini-corpus as a basis for the process: the words to be used in the queries are automatically extracted from it„ a final topic-filtering step is performed using document-similarity measures with this sample corpus. We also describe the changes made to the usual process to adapt it to the peculiarities of Basque, alongside other adjustments to improve the general performance of the system, quality of the collected corpora.