This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
Drahomíra “johanka”Spoustová
Also published as:
Drahomíra „johanka“ Spoustová,
Johanka Spoustová
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
In our paper, we present main results of the Czech grant project Internet as a Language Corpus, whose aim was to build a corpus of Czech web texts and to develop and publicly release related software tools. Our corpus may not be the largest web corpus of Czech, but it maintains very good language quality due to high portion of human work involved in the corpus development process. We describe the corpus contents (2.65 billions of words divided into three parts -- 450 millions of words from news and magazines articles, 1 billion of words from blogs, diaries and other non-reviewed literary units, 1.1 billion of words from discussions messages), particular steps of the corpus creation (crawling, HTML and boilerplate removal, near duplicates removal, language filtering) and its automatic language annotation (POS tagging, syntactic parsing). We also describe our software tools being released under an open source license, especially a fast linear-time module for removing near-duplicates on a paragraph level.
Large corpora are essential to modern methods of computational linguistics and natural language processing. In this paper, we describe an ongoing project whose aim is to build a largest corpus of Czech texts. We are building the corpus from Czech Internet web pages, using (and, if needed, developing) advanced downloading, cleaning and automatic linguistic processing tools. Our concern is to keep the whole process language independent and thus applicable also for building web corpora of other languages. In the paper, we briefly describe the crawling, cleaning, and part-of-speech tagging procedures. Using a prototype corpus, we provide a comparison with a current corpora (in particular, SYN2005, part of the Czech National Corpora). We analyse part-of-speech tag distribution, OOV word ratio, average sentence length and Spearman rank correlation coefficient of the distance of ranks of 500 most frequent words. Our results show that our prototype corpus is now quite homogenous. The challenging task is to find a way to decrease the homogeneity of the text while keeping the high quality of the data.
In our paper we present a methodology used for low-cost validation of quality of Part-of-Speech annotation of the Prague Dependency Treebank based on multiple re-annotation of data samples carefully selected with the help of several different Part-of-Speech taggers.