Abstract
We present SETimes.HR ― the first linguistically annotated corpus of Croatian that is freely available for all purposes. The corpus is built on top of the SETimes parallel corpus of nine Southeast European languages and English. It is manually annotated for lemmas, morphosyntactic tags, named entities and dependency syntax. We couple the corpus with domain-sensitive test sets for Croatian and Serbian to support direct model transfer evaluation between these closely related languages. We build and evaluate statistical models for lemmatization, morphosyntactic tagging, named entity recognition and dependency parsing on top of SETimes.HR and the test sets, providing the state of the art in all the tasks. We make all resources presented in the paper freely available under a very permissive licensing scheme.- Anthology ID:
- L14-1542
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 1724–1727
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/690_Paper.pdf
- DOI:
- Cite (ACL):
- Željko Agić and Nikola Ljubešić. 2014. The SETimes.HR Linguistically Annotated Corpus of Croatian. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 1724–1727, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- The SETimes.HR Linguistically Annotated Corpus of Croatian (Agić & Ljubešić, LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/690_Paper.pdf