Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion
Abstract:
The creation of artificial polyglot voices remains a challenging task, despite considerable progress in recent years.
This paper investigates self-supervised learning for voice conversion to create native-sounding polyglot voices. We
introduce a novel cross-lingual any-to-one voice conversion system that is able to preserve the source accent without
the need for multilingual data from the target speaker. In addition, we show a novel cross-lingual fine-tuning strategy
that further improves the accent and reduces the training data requirements. Objective and subjective evaluations with
English, Spanish, French and Mandarin Chinese confirm that our approach improves on state-of-the-art methods, enhancing
the speech intelligibility and overall quality of the converted speech, especially in cross-lingual scenarios.
Intra-lingual - English
In this section, we present some speech samples used in the intra-lingual subjective evaluation.
We focus on any-to-one conversion using
LJSpeech as
the target and LibriSpeech
test-clean as the source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Text
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
AY AND SHOW YOU SOME PRETTY TRICKS
THE DEPARTING LADIES WHO HAD SAID THEY WOULD STAY DIDN'T OF COURSE THANK HEAVEN STAY THEY DEPARTED IN CONSEQUENCE OF ARRANGEMENTS MADE IN A RAGE OF CURIOSITY AS THEY PROFESSED PRODUCED BY THE TOUCHES WITH WHICH HE HAD ALREADY WORKED US UP
AND I DECLARE IT'S TOO BAD THAT IT IS
WHY FADES THE LOTUS OF THE WATER
Cross-lingual - French
In this section, we present some cross-lingual speech samples for French.
We use LJSpeech as the target and Multilingual LibriSpeech (MLS) French dev + test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Text
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
le sultan de son côté témoigna de l'impatience d'apprendre quel démêlé le génie avait eu avec salomon c'est pourquoi scheherazade poursuivit ainsi le conte du pêcheur
la paix ne fit que paraître la guerre recommença aussitôt par le dessein qu'eut le roi de faire arrêter à noyers le prince de condé et l'amiral de châtillon et ce dessein ayant été découvert l'on commença de nouveau les préparatifs de la guerre et le prince de monpensier fut contraint de quitter sa femme pour se rendre où son devoir l'appelait
un vol d'oiseau nocturne au travers des charpentes d'un clocher mais il se calma mécontent de lui est-ce qu'on ne pouvait faire les choses froidement
elle resta tristement au bord du toit d'où elle vit s'éloigner sa famille et elle serait certainement morte de faim de froid et de chagrin si les enfants de la maison ne l'avaient recueillie
Cross-lingual - Spanish
In this section, we present some cross-lingual speech samples for Spanish.
We use LJSpeech as the target and Multilingual LibriSpeech (MLS) Spanish test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Text
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
esta gentil moza pues ayudó a la doncella y las dos hicieron una muy mala cama a don quijote en un camaranchón que en otros tiempos daba manifiestos indicios que había servido de pajar muchos años
con todo esto si yo no quiero dormir y estarme despierto toda la noche sin pegar pestaña será vuestra merced bastante con todo su poder para hacerme dormir si yo no quiero
unos dicen que lo compuso homero el poeta ciego de la barba de rizos que andaba de pueblo en pueblo cantando sus versos al compás de la lira como hacían los aedas de entonces
ni gracia bastante para hacer de aquel pasado en el que aparecían enlazados ana y él un cuadro de contemplación peligrosa y es que ana pensaba con la señora de rusell que un matrimonio más adecuado hubiérale mejorado notablemente
Cross-lingual - Mandarin Chinese
In this section, we present some cross-lingual speech samples for Mandarin Chinese.
We use LJSpeech as the target and Aishell dev + test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.
Target speaker:
Text
Source
Soft-VC
kNN-VC
Proposed
Proposed-F
再 加 上 运用 金融 市场 的 手段
发 改 委 项目 申报 流程
强化 农民 专业 合作社 组织 带动 能力
加快 建设 环境 监测 预警 体系
Ablation - Emotions Maintenance
In this section, we present some intra-lingual speech samples used in the ablation evaluation.
We use LJSpeech as the target and Emotional Speech Dataset (ESD) as source speech.
We compare Proposed and Proposed-F considering four emotions: angry, happy, sad, surprise.
Target speaker:
Emotion
Text
Source
Proposed
Proposed-F
Angry
At the roots-of a bush of a grass.
Angry
Why it is just like the round egg which sounds thin.