Many digital humanists (philologists, historians, sociologists, librarians, the audience for web archives) design their research around metadata (publication date ranges, sources, authors, etc.). However, current major web archives are limited to technical metadata while lacking high quality, descriptive metadata allowing for faceted queries. As researchers often lack the technical skill necessary to enrich existing web archives with descriptive metadata, they increasingly turn to creating personal web archives that contain such metadata, tailored to their research requirements. Software that enable creating such archives without advanced technical skills have gained popularity, however, tools for examination and querying are currently the missing link. We showcase a solution designed to fill this gap.
ELTE Poetry Corpus is a database that stores canonical Hungarian poetry with automatically generated annotations of the poems’ structural units, grammatical features and sound devices, i.e. rhyme patterns, rhyme pairs, rhythm, alliterations and the main phonological features of words. The corpus has an open access online query tool with several search functions. The paper presents the main stages of the annotation process and the tools used for each stage. The TEI XML format of the different versions of the corpus, each of which contains an increasing number of annotation layers, is presented as well. We have also specified our own XML format for the corpus, slightly different from TEI, in order to make it easier and faster to execute queries on the corpus. We discuss the results of a manual evaluation of the quality of automatic annotation of rhythm, as well as the results of an automatic evaluation of different rule sets used for the automatic annotation of rhyme patterns. Finally, the paper gives an overview of the main functions of the online query tool developed for the corpus.
In this article, we present the method we used to create a middle-sized corpus using targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies rule-based components that allow the curation of the text and the metadata content. The curated data can thereon serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and customisation. Our concept can also be applied to other genres of portals by using the discovered patterns in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus, our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal manpower and computational resources. Reproducing the corpus is possible if changes are introduced to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used for the output format, but we stress that placing the corpus in a digital repository system is recommended in order to be able to define semantic relations between the segments and to add rich annotation.