@inproceedings{spoustova-spousta-2012-high,
    title = "A High-Quality Web Corpus of {C}zech",
    author = "Spoustov{\'a}, Johanka  and
      Spousta, Miroslav",
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Declerck, Thierry  and
      Do{\u{g}}an, Mehmet U{\u{g}}ur  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://preview.aclanthology.org/ingest-emnlp/L12-1008/",
    pages = "311--315",
    abstract = "In our paper, we present main results of the Czech grant project Internet as a Language Corpus, whose aim was to build a corpus of Czech web texts and to develop and publicly release related software tools. Our corpus may not be the largest web corpus of Czech, but it maintains very good language quality due to high portion of human work involved in the corpus development process. We describe the corpus contents (2.65 billions of words divided into three parts -- 450 millions of words from news and magazines articles, 1 billion of words from blogs, diaries and other non-reviewed literary units, 1.1 billion of words from discussions messages), particular steps of the corpus creation (crawling, HTML and boilerplate removal, near duplicates removal, language filtering) and its automatic language annotation (POS tagging, syntactic parsing). We also describe our software tools being released under an open source license, especially a fast linear-time module for removing near-duplicates on a paragraph level."
}Markdown (Informal)
[A High-Quality Web Corpus of Czech](https://preview.aclanthology.org/ingest-emnlp/L12-1008/) (Spoustová & Spousta, LREC 2012)
ACL
- Johanka Spoustová and Miroslav Spousta. 2012. A High-Quality Web Corpus of Czech. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 311–315, Istanbul, Turkey. European Language Resources Association (ELRA).