@inproceedings{schafer-2016-commoncow,
    title = "{C}ommon{COW}: Massively Huge Web Corpora from {C}ommon{C}rawl Data and a Method to Distribute them Freely under Restrictive {EU} Copyright Laws",
    author = {Sch{\"a}fer, Roland},
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Declerck, Thierry  and
      Goggi, Sara  and
      Grobelnik, Marko  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Mazo, Helene  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
    month = may,
    year = "2016",
    address = "Portoro{\v{z}}, Slovenia",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://preview.aclanthology.org/landing_page/L16-1712/",
    pages = "4500--4504",
    abstract = "In this paper, I describe a method of creating massively huge web corpora from the CommonCrawl data sets and redistributing the resulting annotations in a stand-off format. Current EU (and especially German) copyright legislation categorically forbids the redistribution of downloaded material without express prior permission by the authors. Therefore, such stand-off annotations (or other derivates) are the only format in which European researchers (like myself) are allowed to re-distribute the respective corpora. In order to make the full corpora available to the public despite such restrictions, the stand-off format presented here allows anybody to locally reconstruct the full corpora with the least possible computational effort."
}Markdown (Informal)
[CommonCOW: Massively Huge Web Corpora from CommonCrawl Data and a Method to Distribute them Freely under Restrictive EU Copyright Laws](https://preview.aclanthology.org/landing_page/L16-1712/) (Schäfer, LREC 2016)
ACL