Tiit Hennoste

2010

pdf abs
Information Retrieval of Word Form Variants in Spoken Language Corpora Using Generalized Edit Distance
Siim Orasmaa | Reina Käärik | Jaak Vilo | Tiit Hennoste
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

An important feature of spoken language corpora is existence of different spelling variants of words in transcription. So there is an important problem for linguist who works with large spoken corpora: how to find all variants of the word without annotating them manually? Our work describes a search engine that enables finding different spelling variants (true positives) from corpus of spoken language, and reduces efficiently the amount of false positives returned during the search. Our search engine uses a generalized variant of the edit distance algorithm that allows defining text-specific string to string transformations in addition to the default edit operations defined in edit distance. We have extended our algorithm with capability to block transformations in specific substrings of search words. User can mark certain regions (blocked regions) of the search word where edit operations are not allowed. Our material comes from the Corpus of Spoken Estonian of the University of Tartu which consists of about 2000 dialogues and texts, about 1.4 million running text units in total.

2008

pdf abs
From Human Communication to Intelligent User Interfaces: Corpora of Spoken Estonian
Tiit Hennoste | Olga Gerassimenko | Riina Kasterpalu | Mare Koit | Andriela Rääbis | Krista Strandson
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We argue for the necessity of studying human-human spoken conversations of various kinds in order to create user interfaces to databases. An efficient user interface benefits from a well-organized corpus that can be used for investigating the strategies people use in conversations in order to be efficient and to handle the spoken communication problems. For modeling the natural behaviour and testing the model we need a dialogue corpus where the roles of participants are close to the roles of the dialogue system and its user. For that reason, we collect and investigate the Corpus of the Spoken Estonian and the Estonian Dialogue Corpus as the sources for human-human interaction investigation. The transcription conventions and annotation typology of spoken human-human dialogues in Estonian are introduced. For creating a user interface the corpus of one institutional conversation type is insufficient, since we need to know what phenomena are inherent for the spoken language in general, what means are used only in certain types of the conversations and what are the differences.

2004

pdf
Other-Initiated Self-Repairs in Estonian Information Dialogues: Solving Communication Problems in Cooperation
Olga Gerassimenko | Tiit Hennoste | Mare Koit | Andriela Rääbis
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004

2003

Co-authors

Venues

sigdial2
lrec2