P Vijayalakshmi
2024
Utilizing POS-Driven Pitch Contour Analysis for Enhanced Tamil Text-to-Speech Synthesis
Preethi Thinakaran
|
Anushiya Rachel Gladston
|
P Vijayalakshmi
|
T Nagarajan
|
Malarvizhi Muthuramalingam
|
Sooriya S
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
A novel approach to text-to-speech synthesis that integrates pitch contour labels derived from the highest occurrence analysis for each Part-of-Speech (POS) tag. Using the Stanford POS Tagger, grammatical tags are assigned to words, and the most frequently occurring pitch contour labels associated with these tags are analyzed, focusing on both unigram and bigram statistics. The primary goal is to identify the pitch contour for each POS tag based on its frequency of occurrence. These pitch contour labels are incorporated into the output of the synthesized waveform using the TD-PSOLA (Time Domain Pitch Synchronous Overlap and Add) signal processing algorithm. The resulting waveform is evaluated using Mean Opinion Scores (MOS), demonstrating significant enhancements in quality and producing a prosodically rich synthetic speech.
Chirp Group Delay based Feature for Speech Applications
Malarvizhi Muthuramalingam
|
Anushiya Rachel Gladston
|
P Vijayalakshmi
|
T Nagarajan
Proceedings of the 21st International Conference on Natural Language Processing (ICON)
Conventional Fast Fourier Transform (FFT),computed on the unit circle, gives an accurate representation of the spectrum if the signal under consideration is because of the sustained oscillations. However, practical signals are not sustained oscillations. For the signals that are either decaying/growing along time, the phase spectrum computed using conventional FFT is not accurate, and in turn, the magnitude spectrum too. Hence a feature, based on a variant of the group delay spectrum, namely the chirp group delay (CGD) spectrum, is proposed. The efficacy of the proposed feature is evaluated in Gaussian Mixture Model (GMM) and Convolutional Neural Network (CNN)-based speaker identification systems. Analysis reveals a significant increase in performance when using the CGD-based feature over the magnitude spectrum.