Bruno Laranjeira
2014
Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them
Bruno Laranjeira
|
Viviane Moreira
|
Aline Villavicencio
|
Carlos Ramisch
|
Maria José Finatto
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.