Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Cristian Musat, Andreas Fischer


Abstract
This paper presents SwissCrawl, the largest Swiss German text corpus to date. Composed of more than half a million sentences, it was generated using a customized web scraping tool that could be applied to other low-resource languages as well. The approach demonstrates how freely available web pages can be used to construct comprehensive text corpora, which are of fundamental importance for natural language processing. In an experimental evaluation, we show that using the new corpus leads to significant improvements for the task of language modeling.
Anthology ID:
2020.lrec-1.329
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2706–2711
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.329
DOI:
Bibkey:
Cite (ACL):
Lucy Linder, Michael Jungo, Jean Hennebert, Claudiu Cristian Musat, and Andreas Fischer. 2020. Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 2706–2711, Marseille, France. European Language Resources Association.
Cite (Informal):
Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German (Linder et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2020.lrec-1.329.pdf
Code
 derlin/swisstext-lrec +  additional community code