TALC-sef A Manually-Revised POS-TAgged Literary Corpus in Serbian, English and French

Antonio Balvet, Dejan Stosic, Aleksandra Miletic


Abstract
In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus.
Anthology ID:
L14-1591
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
4105–4110
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/755_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Antonio Balvet, Dejan Stosic, and Aleksandra Miletic. 2014. TALC-sef A Manually-Revised POS-TAgged Literary Corpus in Serbian, English and French. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 4105–4110, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
TALC-sef A Manually-Revised POS-TAgged Literary Corpus in Serbian, English and French (Balvet et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/755_Paper.pdf