Peteris Paikens

Also published as: Pēteris Paikens

2020

pdf bib abs
Development and Evaluation of Speech Synthesis Corpora for Latvian
Roberts Darģis | Peteris Paikens | Normunds Gruzitis | Ilze Auzina | Agate Akmane
Proceedings of the 12th Language Resources and Evaluation Conference

Text to speech (TTS) systems are necessary for all languages to ensure accessibility and availability of digital language services. Recent advances in neural speech synthesis have eText to speech (TTS) systems are necessary for any language to ensure accessibility and availability of digital language services. Recent advances in neural speech synthesis have enabled the development of such systems with a data-driven approach that does not require significant development of language-specific tools. However, smaller languages often lack speech corpora that would be sufficient for training current neural TTS models, which require at least 30 hours of good quality audio recordings from a single speaker in a noiseless environment with matching transcriptions. Making such a corpus manually can be cost prohibitive. This paper presents an unsupervised approach to obtain a suitable corpus from unannotated recordings using automated speech recognition for transcription, as well as automated speaker segmentation and identification. The proposed method and software tools are applied and evaluated on a case study for developing a corpus suitable for Latvian speech synthesis based on Latvian public radio archive data.nabled the development of such systems with a data-driven approach that does not require much language-specific tool development. However, smaller languages often lack speech corpora that would be sufficient for training current neural TTS models, which require approximately 30 hours of good quality audio recordings from a single speaker in a noiseless environment with matching transcriptions. Making such a corpus manually can be cost prohibitive. This paper presents an unsupervised approach to obtain a suitable corpus from unannotated recordings using automated speech recognition for transcription, as well as automated speaker segmentation and identification. The proposed methods and software tools are applied and evaluated on a case study for developing a corpus suitable for Latvian speech synthesis based on Latvian public radio archive data.

2018

pdf bib
Creation of a Balanced State-of-the-Art Multilayer Corpus for NLU
Normunds Gruzitis | Lauma Pretkalnina | Baiba Saulite | Laura Rituma | Gunta Nespore-Berzkalne | Arturs Znotins | Peteris Paikens
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

We describe an extensive and versatile lexical resource for Latvian, an under-resourced Indo-European language, which we call Tezaurs (Latvian for ‘thesaurus’). It comprises a large explanatory dictionary of more than 250,000 entries that are derived from more than 280 external sources. The dictionary is enriched with phonetic, morphological, semantic and other annotations, as well as augmented by various language processing tools allowing for the generation of inflectional forms and pronunciation, for on-the-fly selection of corpus examples, for suggesting synonyms, etc. Tezaurs is available as a public and widely used web application for end-users, as an open data set for the use in language technology (LT), and as an API ― a set of web services for the integration into third-party applications. The ultimate goal of Tezaurs is to be the central computational lexicon for Latvian, bringing together all Latvian words and frequently used multi-word units and allowing for the integration of other LT resources and tools.

2015

pdf bib
Riga: from FrameNet to Semantic Frames with C6.0 Rules
Guntis Barzdins | Peteris Paikens | Didzis Gosko
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

2014

pdf bib abs
Using C5.0 and Exhaustive Search for Boosting Frame-Semantic Parsing Accuracy
Guntis Barzdins | Didzis Gosko | Laura Rituma | Peteris Paikens
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Frame-semantic parsing is a kind of automatic semantic role labeling performed according to the FrameNet paradigm. The paper reports a novel approach for boosting frame-semantic parsing accuracy through the use of the C5.0 decision tree classifier, a commercial version of the popular C4.5 decision tree classifier, and manual rule enhancement. Additionally, the possibility to replace C5.0 by an exhaustive search based algorithm (nicknamed C6.0) is described, leading to even higher frame-semantic parsing accuracy at the expense of slightly increased training time. The described approach is particularly efficient for languages with small FrameNet annotated corpora as it is for Latvian, which is used for illustration. Frame-semantic parsing accuracy achieved for Latvian through the C6.0 algorithm is on par with the state-of-the-art English frame-semantic parsers. The paper includes also a frame-semantic parsing use-case for extracting structured information from unstructured newswire texts, sometimes referred to as bridging of the semantic gap.

pdf bib abs
Coreference Resolution for Latvian
Artūrs Znotiņš | Pēteris Paikens
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Coreference resolution (CR) is a current problem in natural language processing (NLP) research and it is a key task in applications such as question answering, text summarization and information extraction for which text understanding is of crucial importance. We describe an implementation of coreference resolution tools for Latvian language, developed as a part of a tool chain for newswire text analysis but usable also as a separate, publicly available module. LVCoref is a rule based CR system that uses entity centric model that encourages the sharing of information across all mentions that point to the same real-world entity. The system is developed to provide starting ground for further experiments and generate a reference baseline to be compared with more advanced rule-based and machine learning based future coreference resolvers. It now reaches 66.6 F-score using predicted mentions and 78.1% F-score using gold mentions. This paper describes current efforts to create a CR system and to improve NER performance for Latvian. Task also includes creation of the corpus of manually annotated coreference relations.

2013

pdf bib
Morphological Analysis with Limited Resources: Latvian Example
Pēteris Paikens | Laura Rituma | Lauma Pretkalniņa
Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013)

2012

pdf bib abs
An implementation of a Latvian resource grammar in Grammatical Framework
Pēteris Paikens | Normunds Grūzītis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper describes an open-source Latvian resource grammar implemented in Grammatical Framework (GF), a programming language for multilingual grammar applications. GF differentiates between concrete grammars and abstract grammars: translation among concrete languages is provided via abstract syntax trees. Thus the same concrete grammar is effectively used for both language analysis and language generation. Furthermore, GF differentiates between general-purpose resource grammars and domain-specific application grammars that are built on top of the resource grammars. The GF resource grammar library (RGL) currently supports more than 20 languages that implement a common API. Latvian is the 13th official European Union language that is made available in the RGL. We briefly describe the grammatical features of Latvian and illustrate how they are handled in the multilingual framework of GF. We also illustrate some application areas of the Latvian resource grammar, and briefly discuss the limitations of the RGL and potential long-term improvements using frame semantics.

Co-authors