The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages

Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, Søren Wichmann


Abstract
There exist as many as 7000 natural languages in the world, and a huge number of documents describing those languages have been produced over the years. Most of those documents are in paper format. Any attempts to use modern computational techniques and tools to process those documents will require them to be digitized first. In this paper, we report a multilingual digitized version of thousands of such documents searchable through some well-established corpus infrastructures. The corpus is annotated with various meta, word, and text level attributes to make searching and analysis easier and more useful.
Anthology ID:
2020.lrec-1.110
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
878–884
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.110
DOI:
Bibkey:
Cite (ACL):
Shafqat Mumtaz Virk, Harald Hammarström, Markus Forsberg, and Søren Wichmann. 2020. The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 878–884, Marseille, France. European Language Resources Association.
Cite (Informal):
The DReaM Corpus: A Multilingual Annotated Corpus of Grammars for the World’s Languages (Virk et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.lrec-1.110.pdf