We present sound examples for the experiments in our paper Expressive Neural Voice Cloning. We clone voices for speakers in the VCTK dataset for three tasks Text - Synthesizing speech directly from text for a new speaker, Imitation - Reconstructing a sample of the target speaker from its factorized style and speaker information, Style Transfer - Transfering pitch and rhythm of audio from an expressive speaker to the voice the target speaker. All examples in this website are generated using 10 target speaker samples for voice cloning.
For the imitation task, we use a text and audio pair of the target speaker (not used for voice cloning), and try to reconstruct the audio from its factorized representation using our synthesis model. All of the style conditioning variables - pitch, rhythm and GST embedding are derived from the speech sample we are trying to imitate.
|Ground-Truth||Proposed Model: Zero-shot||Proposed Model: Adaptation Whole||Proposed Model: Adaptation Decoder|
The goal of this task is to transfer the pitch and rhythm from some expressive speech to the cloned speech for the target speaker. For this task, we use examples from the single speaker Blizzard 2013 dataset as style references. This dataset contains expressive audio book readings from a single speaker with high variation in emotion and pitch. For our proposed model, we use this style reference audio to extract the pitch and rhythm. In-order to retain speaker-specific latent style aspects, we use target speaker samples to extract the GST embedding. For the Tacotron2 + GST model, which does not have explicit pitch conditioning, we use the style reference audio for obtaining the GST embedding and the rhythm.
|Style Reference Audio||Target Speaker Audio||Proposed Model: Zero-shot||Proposed Model: Adaptation Whole||Proposed Model: Adaptation Decoder||Proposed Model: Tacotron2 + GST - Zero-shot (baseline)|
For cloning speech directly from text, we first synthesize speech for the given text using a single speaker TTS model - Tacotron 2 + WaveGlow. We then derive the pitch contour of the synthetic speech using the Yin algorithm and scale the pitch contour linearly to have the same mean pitch as that of the. For deriving rhythm, we use our proposed synthesis model as a forced aligner between the text and Tacotron2-synthesized speech. We use the target speaker samples for obtaining the GST embedding for both our proposed model and the baseline Tacotron2 + GST model.
|Target Speaker Audio||Text||Proposed Model: Zero-shot||Proposed Model: Adaptation Whole||Proposed Model: Adaptation Decoder||Proposed Model: Tacotron2 + GST - Zero-shot (baseline)|