Tagging a Corpus of Interpreted Speeches: the European Parliament Interpreting Corpus (EPIC)

Annalisa Sandrelli, Claudio Bendazzoli


Abstract
The performance of three different taggers (Treetagger, Freeling and GRAMPAL) is evaluated on three different languages, i.e. English, Italian and Spanish. The materials are transcripts from the European Parliament Interpreting Corpus (EPIC), a corpus of original (source) and simultaneously interpreted (target) speeches. Owing to the oral nature of our materials and to the specific characteristics of spoken language produced in simultaneous interpreting, the chosen taggers have to deal with non-standard word order, disfluencies and other features not to be found in written language. Parts of the tagged sub-corpora were automatically extracted in order to assess the success rate achieved in tagging and lemmatisation. Errors and problems are discussed for each tagger, and conclusions are drawn regarding future developments.
Anthology ID:
L06-1093
Volume:
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Month:
May
Year:
2006
Address:
Genoa, Italy
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/174_pdf.pdf
DOI:
Bibkey:
Cite (ACL):
Annalisa Sandrelli and Claudio Bendazzoli. 2006. Tagging a Corpus of Interpreted Speeches: the European Parliament Interpreting Corpus (EPIC). In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy. European Language Resources Association (ELRA).
Cite (Informal):
Tagging a Corpus of Interpreted Speeches: the European Parliament Interpreting Corpus (EPIC) (Sandrelli & Bendazzoli, LREC 2006)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2006/pdf/174_pdf.pdf