Elena Grishina
2010
Multimodal Russian Corpus (MURCO): First Steps
Elena Grishina
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The paper introduces the Multimodal Russian Corpus (MURCO), which has been created in the framework of the Russian National Corpus (RNC). The MURCO provides the users with the great amount of phonetic, orthoepic, intonational information related to Russian. Moreover, the deeply annotated part of the MURCO contains the data concerning Russian gesticulation, speech act system, types of vocal gestures and interjections in Russian, and so on. The Corpus is on free access. The paper describes the main types of annotation and the interface structure of the MURCO. The MURCO consists of two parts, the second part being the subset of the first: 1) the whole Corpus, which is annotated from the lexical (lemmatization), morphological, semantic, accentological, metatextual, socioligical point of view (these types of annotation are standard for the RNC), and also from the point of view of phonetics (the orthoepic annotation and the mark-up of accentological word structure), 2) the deeply annotated MURCO, which is annotated in addition from the point of view of gesticulation and speech act structure.
Design and Data Collection for the Accentological Corpus of the Russian Language
Elena Grishina
|
Svetlana Savchuk
|
Alexej Poljakov
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Accentological corpus provides a researcher an opportunity to study word stress and stress variation, which are very important for the Russian language. Moreover, Accentological corpus allows studying the history of the Russian language stress development. The research presents the main characteristics of Accentological corpus available at ruscorpora.ru. Corpora size, type and sources of text material, the way it is represented in the corpora, types of linguistic annotation, corpora composition and ways of their effective use according to their purposes are described. There are two zones in the Accentological corpus. 1) The zone of prose includes oral texts and films transcripts, in which stressed syllables are marked according to the real pronunciation. 2) The zone of poetry contains texts with marked accented syllables, so it is possible to define the exact word stress using special rules. The Accentological corpus has four types of annotations (metatextual, morphological, semantic and sociological) and also has its own accentological mark-up. Due to accentological annotation each word is supplied with stress marks, so a user can make queries and retrieve the stressed or unstressed word forms in combination with grammatical and semantic features.
2006
Spoken Russian in the Russian National Corpus (RNC)
Elena Grishina
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The RNC now it is a 120 million-word collection of Russian text, thus, it is the most representative and authoritative corpus of the Russian language. It is available in the Internet at www.ruscorpora.ru. The RNC contains texts of all genres and types, which covers Russian from 19 up to 21 centuries. The practice of national corpora constructing has revealed that it's indispensable to include in the RNC the sub-corpora of spoken language. Therefore, the constructors of the RNC have an intention to include in it about 10 million words of Spoken Russian. Oral speech in the Corpus is represented in the standard Russian orthography. Although this decision made impossible any phonetic exploration of the Spoken Russian Corpus, but studying Spoken Russian from any other linguistic point of view is completely available. In addition to traditional annotations (metatextual and morphological), in Spoken Sub-corpus there is sociological annotation. Unlike the standard oral speech, which is spontaneous and isn't intended to be reproduced, Multimedia Spoken Russian (MSR) is otherwise in great deal premeditated and evidently meant to be reproduced. MSR is also to be included in the RNC: first of all we plan to make the very interesting and provocative part of the RNC from the textual ingredient of about 300 Russian films.