The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages
Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, Søren Wichmann
Abstract
There exist as many as 7000 natural languages in the world, and a huge number of documents describing those languages have been produced over the years. Most of those documents are in paper format. Any attempts to use modern computational techniques and tools to process those documents will require them to be digitized first. In this paper, we report a multilingual digitized version of thousands of such documents searchable through some well-established corpus infrastructures. The corpus is annotated with various meta, word, and text level attributes to make searching and analysis easier and more useful.- Anthology ID:
- 2020.lrec-1.110
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 878–884
- Language:
- English
- URL:
- https://preview.aclanthology.org/Author-page-Marten-During-lu/2020.lrec-1.110/
- DOI:
- Cite (ACL):
- Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, and Søren Wichmann. 2020. The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 878–884, Marseille, France. European Language Resources Association.
- Cite (Informal):
- The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages (Virk et al., LREC 2020)
- PDF:
- https://preview.aclanthology.org/Author-page-Marten-During-lu/2020.lrec-1.110.pdf