Beyond SoNaR: towards the facilitation of large corpus building efforts

Martin Reynaert, Ineke Schuurman, Véronique Hoste, Nelleke Oostdijk, Maarten van Gompel


Abstract
In this paper we report on the experiences gained in the recent construction of the SoNaR corpus, a 500 MW reference corpus of contemporary, written Dutch. It shows what can realistically be done within the confines of a project setting where there are limitations to the duration in time as well to the budget, employing current state-of-the-art tools, standards and best practices. By doing so we aim to pass on insights that may be beneficial for anyone considering to undertake an effort towards building a large, varied yet balanced corpus for use by the wider research community. Various issues are discussed that come into play while compiling a large corpus, including approaches to acquiring texts, the arrangement of IPR, the choice of text formats, and steps to be taken in the preprocessing of data from widely different origins. We describe FoLiA, a new XML format geared at rich linguistic annotations. We also explain the rationale behind the investment in the high-quali ty semi-automatic enrichment of a relatively small (1 MW) subset with very rich syntactic and semantic annotations. Finally, we present some ideas about future developments and the direction corpus development may take, such as setting up an integrated work flow between web services and the potential role for ISOcat. We list tips for potential corpus builders, tricks they may want to try and further recommendations regarding technical developments future corpus builders may wish to hope for.
Anthology ID:
L12-1437
Volume:
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:
May
Year:
2012
Address:
Istanbul, Turkey
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2897–2904
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/748_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Martin Reynaert, Ineke Schuurman, Véronique Hoste, Nelleke Oostdijk, and Maarten van Gompel. 2012. Beyond SoNaR: towards the facilitation of large corpus building efforts. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 2897–2904, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):
Beyond SoNaR: towards the facilitation of large corpus building efforts (Reynaert et al., LREC 2012)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2012/pdf/748_Paper.pdf