Enhancing Polyglot Voices by Leveraging Cross-Lingual Fine-Tuning in Any-to-One Voice Conversion

Abstract: The creation of artificial polyglot voices remains a challenging task, despite considerable progress in recent years. This paper investigates self-supervised learning for voice conversion to create native-sounding polyglot voices. We introduce a novel cross-lingual any-to-one voice conversion system that is able to preserve the source accent without the need for multilingual data from the target speaker. In addition, we show a novel cross-lingual fine-tuning strategy that further improves the accent and reduces the training data requirements. Objective and subjective evaluations with English, Spanish, French and Mandarin Chinese confirm that our approach improves on state-of-the-art methods, enhancing the speech intelligibility and overall quality of the converted speech, especially in cross-lingual scenarios.

Intra-lingual - English

In this section, we present some speech samples used in the intra-lingual subjective evaluation.
We focus on any-to-one conversion using LJSpeech as the target and LibriSpeech test-clean as the source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.

Target speaker:   
Text Source Soft-VC kNN-VC Proposed Proposed-F
AY AND SHOW YOU SOME PRETTY TRICKS
THE DEPARTING LADIES WHO HAD SAID THEY WOULD STAY DIDN'T OF COURSE THANK HEAVEN STAY THEY DEPARTED IN CONSEQUENCE OF ARRANGEMENTS MADE IN A RAGE OF CURIOSITY AS THEY PROFESSED PRODUCED BY THE TOUCHES WITH WHICH HE HAD ALREADY WORKED US UP
AND I DECLARE IT'S TOO BAD THAT IT IS
WHY FADES THE LOTUS OF THE WATER

Cross-lingual - French

In this section, we present some cross-lingual speech samples for French.
We use LJSpeech as the target and Multilingual LibriSpeech (MLS) French dev + test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.

Target speaker:   
Text Source Soft-VC kNN-VC Proposed Proposed-F
le sultan de son côté témoigna de l'impatience d'apprendre quel démêlé le génie avait eu avec salomon c'est pourquoi scheherazade poursuivit ainsi le conte du pêcheur
la paix ne fit que paraître la guerre recommença aussitôt par le dessein qu'eut le roi de faire arrêter à noyers le prince de condé et l'amiral de châtillon et ce dessein ayant été découvert l'on commença de nouveau les préparatifs de la guerre et le prince de monpensier fut contraint de quitter sa femme pour se rendre où son devoir l'appelait
un vol d'oiseau nocturne au travers des charpentes d'un clocher mais il se calma mécontent de lui est-ce qu'on ne pouvait faire les choses froidement
elle resta tristement au bord du toit d'où elle vit s'éloigner sa famille et elle serait certainement morte de faim de froid et de chagrin si les enfants de la maison ne l'avaient recueillie

Cross-lingual - Spanish

In this section, we present some cross-lingual speech samples for Spanish.
We use LJSpeech as the target and Multilingual LibriSpeech (MLS) Spanish test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.

Target speaker:   
Text Source Soft-VC kNN-VC Proposed Proposed-F
esta gentil moza pues ayudó a la doncella y las dos hicieron una muy mala cama a don quijote en un camaranchón que en otros tiempos daba manifiestos indicios que había servido de pajar muchos años
con todo esto si yo no quiero dormir y estarme despierto toda la noche sin pegar pestaña será vuestra merced bastante con todo su poder para hacerme dormir si yo no quiero
unos dicen que lo compuso homero el poeta ciego de la barba de rizos que andaba de pueblo en pueblo cantando sus versos al compás de la lira como hacían los aedas de entonces
ni gracia bastante para hacer de aquel pasado en el que aparecían enlazados ana y él un cuadro de contemplación peligrosa y es que ana pensaba con la señora de rusell que un matrimonio más adecuado hubiérale mejorado notablemente

Cross-lingual - Mandarin Chinese

In this section, we present some cross-lingual speech samples for Mandarin Chinese.
We use LJSpeech as the target and Aishell dev + test as source speech.
We compare Proposed and Proposed-F against two baselines: Soft-VC and kNN-VC.

Target speaker:   
Text Source Soft-VC kNN-VC Proposed Proposed-F
再 加 上 运用 金融 市场 的 手段
发 改 委 项目 申报 流程
强化 农民 专业 合作社 组织 带动 能力
加快 建设 环境 监测 预警 体系

Ablation - Emotions Maintenance

In this section, we present some intra-lingual speech samples used in the ablation evaluation.
We use LJSpeech as the target and Emotional Speech Dataset (ESD) as source speech.
We compare Proposed and Proposed-F considering four emotions: angry, happy, sad, surprise.

Target speaker:   
Emotion Text Source Proposed Proposed-F
Angry At the roots-of a bush of a grass.
Angry Why it is just like the round egg which sounds thin.
Angry This speech roused dame Ilse to anger.
Angry Our King George is labourers.
Angry And there you'll find a snap dragon fly.
Happy Confess you opened the thirteenth door.
Happy It says no way! shouted Daisy.
Happy And they were sandy yellow brownish all over.
Happy But one requires the explorer to furnish proofs.
Happy We expected Tom would jump for joy.
Sad From August eighteenth, of their divorce.
Sad Wake now my merry tads!
Sad Clear than clear water!
Sad Mister Lawson saw George last night.
Sad Take courage all isn't lost yet.
Surprise The fisherman and his wife see George every day.
Surprise However, somebody killed something.
Surprise Story twenty nine a boy and a monkey.
Surprise Andy what's the gyre and to gimble.
Surprise Come on my jack in the boxes!