Abstract
Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool initiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task.- Anthology ID:
- L14-1429
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 728–735
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/518_Paper.pdf
- DOI:
- Cite (ACL):
- Rodrigo Boos, Kassius Prestes, and Aline Villavicencio. 2014. Identification of Multiword Expressions in the brWaC. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 728–735, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- Identification of Multiword Expressions in the brWaC (Boos et al., LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/518_Paper.pdf