Expressive Neural Voice Cloning - Audio Examples

*Paarth Neekhara, *Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar, Julian McAuley
University of California San Diego
* Equal contribution

ACML 2021

[paper] [demo]

We present sound examples for the experiments in our paper Expressive Neural Voice Cloning. We clone voices for speakers in the VCTK dataset for three tasks Text - Synthesizing speech directly from text for a new speaker, Imitation - Reconstructing a sample of the target speaker from its factorized style and speaker information, Style Transfer - Transfering pitch and rhythm of audio from an expressive speaker to the voice the target speaker. All examples in this website are generated using 10 target speaker samples for voice cloning.

Imitation Task

For the imitation task, we use a text and audio pair of the target speaker (not used for voice cloning), and try to reconstruct the audio from its factorized representation using our synthesis model. All of the style conditioning variables - pitch, rhythm and GST embedding are derived from the speech sample we are trying to imitate.

Ground-Truth Proposed Model: Zero-shot Proposed Model: Adaptation Whole Proposed Model: Adaptation Decoder

Style Transfer Task

The goal of this task is to transfer the pitch and rhythm from some expressive speech to the cloned speech for the target speaker. For this task, we use examples from the single speaker Blizzard 2013 dataset as style references. This dataset contains expressive audio book readings from a single speaker with high variation in emotion and pitch. For our proposed model, we use this style reference audio to extract the pitch and rhythm. In-order to retain speaker-specific latent style aspects, we use target speaker samples to extract the GST embedding. For the Tacotron2 + GST model, which does not have explicit pitch conditioning, we use the style reference audio for obtaining the GST embedding and the rhythm.

Style Reference Audio Target Speaker Audio Proposed Model: Zero-shot Proposed Model: Adaptation Whole Proposed Model: Adaptation Decoder Proposed Model: Tacotron2 + GST - Zero-shot (baseline)

Text Task

For cloning speech directly from text, we first synthesize speech for the given text using a single speaker TTS model - Tacotron 2 + WaveGlow. We then derive the pitch contour of the synthetic speech using the Yin algorithm and scale the pitch contour linearly to have the same mean pitch as that of the. For deriving rhythm, we use our proposed synthesis model as a forced aligner between the text and Tacotron2-synthesized speech. We use the target speaker samples for obtaining the GST embedding for both our proposed model and the baseline Tacotron2 + GST model.

Target Speaker Audio Text Proposed Model: Zero-shot Proposed Model: Adaptation Whole Proposed Model: Adaptation Decoder Proposed Model: Tacotron2 + GST - Zero-shot (baseline)