Abstract
We acquire corpora from the domain of independent news from the Tlaxcala website. We build monolingual corpora for 15 languages and parallel corpora for all the combinations of those 15 languages. These corpora include languages for which only very limited such resources exist (e.g. Tamazight). We present the acquisition process in detail and we also present detailed statistics of the produced corpora, concerning mainly quantitative dimensions such as the size of the corpora per language (for the monolingual corpora) and per language pair (for the parallel corpora). To the best of our knowledge, these are the first publicly available parallel and monolingual corpora for the domain of independent news. We also create models for unsupervised sentence splitting for all the languages of the study.- Anthology ID:
- L14-1093
- Volume:
- Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
- Month:
- May
- Year:
- 2014
- Address:
- Reykjavik, Iceland
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 3689–3692
- Language:
- URL:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1134_Paper.pdf
- DOI:
- Cite (ACL):
- Antonio Toral. 2014. TLAXCALA: a multilingual corpus of independent news. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 3689–3692, Reykjavik, Iceland. European Language Resources Association (ELRA).
- Cite (Informal):
- TLAXCALA: a multilingual corpus of independent news (Toral, LREC 2014)
- PDF:
- http://www.lrec-conf.org/proceedings/lrec2014/pdf/1134_Paper.pdf