Heete Sahkai
2026
Using LLMs to Extract Instances of Schematic Constructions from Unannotated L2 Learner Corpora
Jelena Kallas | Ahto Kiil | Heete Sahkai | Geda Paulsen | Kertu Saul
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Jelena Kallas | Ahto Kiil | Heete Sahkai | Geda Paulsen | Kertu Saul
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Our previous study found that generative LLMs can be successfully used to identify instances of schematic constructions (as defined in Construction Grammar) in unannotated L1 corpus data. This study tests the applicability of LLMs to also identify instances of constructions in unannotated L2 data. L2 learner corpora are notoriously difficult to annotate and query since they contain errors. Using LLMs can thus simplify the retrieval of construction data from L2 corpora. The identification of instances of constructions in L2 learner data has many possible uses in pedagogical applications of Construction Grammar and constructicography, like the identification of error-prone (properties of) constructions and the distribution of constructional instances across CEFR levels. Using the Estonian Nominal Quantifier Construction as the example construction and an Estonian CEFR-graded learner corpus as the source of L2 data, we tested several prompts and several models (OpenAI’s o3-mini, o3, gpt-5-mini and gpt-5, Google DeepMind’s Gemini Flash 2.5, Anthropic’s Claude Sonnet 4.5 and Opus 4.1). We found that the best model, gpt-5, achieved F1-scores from 0.90 to 0.96, depending on the level of detail of the prompt.
2025
Estonian isolated-word text-to-speech synthesiser
Indrek Kiissel | Liisi Piits | Heete Sahkai | Indrek Hein | Liis Ermus | Meelis Mihkla
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
Indrek Kiissel | Liisi Piits | Heete Sahkai | Indrek Hein | Liis Ermus | Meelis Mihkla
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)
This paper presents the development and evaluation of an Estonian isolated-word text-to-speech (TTS) synthesiser. Unlike conventional TTS systems that convert continuous text into speech, this system focuses on the synthesis of isolated words, which is crucial for applications such as pronunciation training, speech therapy, and (learners’) dictionaries. The system addresses two key challenges: generating natural prosody for isolated words and context-free disambiguation of homographs. We conducted a perception test to evaluate the performance of the TTS system in terms of pronunciation accuracy. We used 16 pairs of homographs that differ in palatalisation and 16 pairs of homographs that differ in quantity. Given that all the test items were correctly recognised by a majority of the evaluators, the performance of the synthesiser can be considered very good.
2022
Audiobook Dialogues as Training Data for Conversational Style Synthetic Voices
Liisi Piits | Hille Pajupuu | Heete Sahkai | Rene Altrov | Liis Ermus | Kairi Tamuri | Indrek Hein | Meelis Mihkla | Indrek Kiissel | Egert Männisalu | Kristjan Suluste | Jaan Pajupuu
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Liisi Piits | Hille Pajupuu | Heete Sahkai | Rene Altrov | Liis Ermus | Kairi Tamuri | Indrek Hein | Meelis Mihkla | Indrek Kiissel | Egert Männisalu | Kristjan Suluste | Jaan Pajupuu
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Synthetic voices are increasingly used in applications that require a conversational speaking style, raising the question as to which type of training data yields the most suitable speaking style for such applications. This study compares voices trained on three corpora of equal size recorded by the same speaker: an audiobook character speech (dialogue) corpus, an audiobook narrator speech corpus, and a neutral-style sentence-based corpus. The voices were trained with three text-to-speech synthesisers: two hidden Markov model-based synthesisers and a neural synthesiser. An evaluation study tested the suitability of their speaking style for use in customer service voice chatbots. Independently of the synthesiser used, the voices trained on the character speech corpus received the lowest, and those trained on the neutral-style corpus the highest scores. However, the evaluation results may have been confounded by the greater acoustic variability, less balanced sentence length distribution, and poorer phonemic coverage of the character speech corpus, especially compared to the neutral-style corpus. Therefore, the next step will be the creation of a more uniform, balanced, and representative audiobook dialogue corpus, and the evaluation of its suitability for further conversational-style applications besides customer service chatbots.