YourTTS voice conversion for mortals #1084

jreus · 2022-01-08T11:30:59Z

jreus
Jan 8, 2022

Dear coqui/TTS team, thank you for your fantastic work on this project. :-) My expertise with TTS and deep learning is rather modest, so please excuse if I am asking ignorant questions ...

I'm working on creating custom voice TTS models using coqui. Until now it seems like the best approach to do this is to create a custom labeled dataset (text + speech snippets) of around ~1hour, and then fine tune using a model pre-trained on a voice that is similar to the one you are aiming to create. (maybe I'm wrong about this)

I really like the prosody quality of tacotron2 with GST, so that has been the model I have been studying the most. But now with the release of coqui 0.5.0 we have the pre-trained multi-speaker/multi-lingual YourTTS model, which seems to be capable of fine-tuning with much less training data, and even zero-shot voice conversion. Amazing!

However, I am having great difficulty replicating the process of the YourTTS colab demos in the latest coqui 0.5.0.

I've tried my best to get a similar thing happening in a basic script using the pre-trained YourTTS model, yet I keep getting the mysterious error below, which, looking into TTS/tts/models/vits.py I'm still struggling to quite understand why it's complaining about this not being a multi-speaker model.

Any thoughts/ideas how to proceed?

Traceback (most recent call last):
  File "/home/experiments/yourtts-vc.py", line 132, in <module>
    ref_wav_voc, _, _ = model.voice_conversion(driving_spec, y_lengths, driving_embedding, target_embedding)
  File "/home/TTS/TTS/tts/models/vits.py", line 731, in voice_conversion
    raise RuntimeError(" [!] Voice conversion is only supported on multi-speaker models.")
RuntimeError:  [!] Voice conversion is only supported on multi-speaker models.

Here's my script... which hopefully can serve as a reference if I can get it working.

import os
import numpy as np
import torch
from TTS.utils.synthesizer import Synthesizer
from TTS.utils.audio import AudioProcessor

from TTS.tts.models.vits import *
from TTS.tts.utils.speakers import SpeakerManager
import librosa


# Helper function, computes spectrogram from reference wav
def compute_spec(ref_wav, ap):
  y, sr = librosa.load(ref_wav, sr=ap.sample_rate)
  spec = ap.spectrogram(y)
  spec = torch.FloatTensor(spec).unsqueeze(0)
  return spec

COQUI_MODEL_PATH = os.path.expanduser("~/.local/share/tts/tts_models--multilingual--multi-dataset--your_tts/")
print(f"FOUND PATH>>> {COQUI_MODEL_PATH}")
MODEL_PATH = os.path.join(COQUI_MODEL_PATH, "model_file.pth.tar")
CONFIG_PATH = os.path.join(COQUI_MODEL_PATH, "config.json")
LANGUAGES_PATH = os.path.join(COQUI_MODEL_PATH, "language_ids.json")
SPEAKERS_PATH = os.path.join(COQUI_MODEL_PATH, "speakers.json")
SE_MODEL_PATH = os.path.join(COQUI_MODEL_PATH, "model_se.pth.tar")
SE_CONFIG_PATH = os.path.join(COQUI_MODEL_PATH, "config_se.json")

OUT_PATH = os.path.abspath("./tts_outputs/")
if not os.path.exists(OUT_PATH):
    os.mkdir(OUT_PATH)

USE_CUDA = False
IN_PATH = os.path.abspath("./tts_inputs/")
DRIVING_REF = os.path.join(IN_PATH, "p304_013.wav")
DRIVING_WAVS = ["p304_014.wav","p304_015.wav","p304_016.wav","p304_017.wav","p304_018.wav"]
DRIVING_WAVS = [os.path.join(IN_PATH, x) for x in DRIVING_WAVS]
TARGET_WAVS = ["p225_001.wav", "p225_002.wav", "p225_003.wav", "p225_004.wav", "p225_005.wav"]
TARGET_WAVS = [os.path.join(IN_PATH, x) for x in TARGET_WAVS]

# load model
synthesizer = Synthesizer(
    MODEL_PATH,
    CONFIG_PATH,
    SPEAKERS_PATH,
    LANGUAGES_PATH,
    None,
    None,
    None,
    None,
    USE_CUDA
)

model = synthesizer.tts_model
audioproc = synthesizer.ap
config = synthesizer.tts_config


# SPEAKER ENCODER SETUP
SE_speaker_manager = SpeakerManager(encoder_model_path=SE_MODEL_PATH, encoder_config_path=SE_CONFIG_PATH, use_cuda=USE_CUDA)

tmp = torch.load(MODEL_PATH, map_location=torch.device('cpu'))
# remove the existing speaker encoder if it exists...
model_weights = tmp['model'].copy()
for key in list(model_weights.keys()):
    if "speaker_encoder" in key:
        del model_weights[key]

model.load_state_dict(model_weights)
model.eval()
if USE_CUDA:
    model = model.cuda()

# ZERO SHOT VOICE CONVERSION
print(f"Select driving speaker reference audio files: {DRIVING_WAVS}")
norm_drivings = []
for sample in DRIVING_WAVS:
    print(f"Normalize: {sample}")
    newfile = f"n_{os.path.basename(sample)}"
    os.system(f"ffmpeg-normalize {sample} -nt rms -t=-27 -o {newfile} -ar 16000 -f")
    norm_drivings.append(newfile)

driving_embedding = SE_speaker_manager.compute_d_vector_from_clip(norm_drivings)
driving_embedding = torch.FloatTensor(driving_embedding).unsqueeze(0)

print(f"Select driving audio file reference: {DRIVING_REF}")
sample=DRIVING_REF
print(f"Normalize: {sample}")
newfile = f"n_{os.path.basename(sample)}"
os.system(f"ffmpeg-normalize {sample} -nt rms -t=-27 -o {newfile} -ar 16000 -f")
driving_spec = compute_spec(newfile, audioproc)
y_lengths = torch.tensor([driving_spec.size(-1)])

print(f"Select target speaker reference audio files: {TARGET_WAVS}")
norm_targets = []
for sample in TARGET_WAVS:
    print(f"Normalize: {sample}")
    newfile = f"r_{os.path.basename(sample)}"
    os.system(f"ffmpeg-normalize {sample} -nt rms -t=-27 -o {newfile} -ar 16000 -f")
    norm_targets.append(newfile)

target_embedding = SE_speaker_manager.compute_d_vector_from_clip(norm_targets)
target_embedding = torch.FloatTensor(target_embedding).unsqueeze(0)

# Run voice conversion...
if USE_CUDA:
    ref_wav_voc, _, _ = model.voice_conversion(driving_spec.cuda(), y_lengths.cuda(), driving_embedding.cuda(), target_embedding.cuda())
    ref_wav_voc = ref_wav_voc.squeeze().cpu().detach().numpy()
else:
    ref_wav_voc, _, _ = model.voice_conversion(driving_spec, y_lengths, driving_embedding, target_embedding)
    ref_wav_voc = ref_wav_voc.squeeze().detach().numpy()

print("Reference Audio after decoder:")
out_path = os.path.join(OUT_PATH, "vc_result.wav")
print(f" > Saving output to {out_path}")
audioproc.save_wav(ref_wav_voc, out_path)

Shivank1006 · 2022-01-18T13:24:47Z

Shivank1006
Jan 18, 2022

Hey @jreus
You solved this?

3 replies

jreus Jan 18, 2022
Author

Hey @Shivank1006
I followed the advice from @vince62s and changed line 727 in TTS/tts/models/vits.py

        # speaker embedding
        if self.args.use_speaker_embedding and not self.args.use_d_vector_file:
            g_src = self.emb_g(speaker_cond_src).unsqueeze(-1)
            g_tgt = self.emb_g(speaker_cond_tgt).unsqueeze(-1)
# original line       elif self.args.use_speaker_embedding and self.args.use_d_vector_file:
        elif self.args.use_speaker_embedding or self.args.use_d_vector_file:
            g_src = F.normalize(speaker_cond_src).unsqueeze(-1)
            g_tgt = F.normalize(speaker_cond_tgt).unsqueeze(-1)
        else:
            raise RuntimeError(" [!] Voice conversion is only supported on multi-speaker models.")

This got me past the RuntimeError, but now I'm getting another strange error with a convolution size mismatch.. being triggered in the call to model.voice_conversion()

Traceback (most recent call last):
  File "yourtts-postversion.py", line 131, in <module>
    ref_wav_voc, _, _ = model.voice_conversion(driving_spec, y_lengths, driving_embedding, target_embedding)
  File "TTS/tts/models/vits.py", line 734, in voice_conversion
    z, _, _, y_mask = self.posterior_encoder(y.transpose(1, 2), y_lengths, g=g_src)
  File "miniconda3/envs/ancestor/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "TTS/tts/layers/vits/networks.py", line 276, in forward
    x = self.pre(x) * x_mask
  File "miniconda3/envs/ancestor/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "miniconda3/envs/ancestor/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 301, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "miniconda3/envs/ancestor/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 297, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 513, 1], expected input[1, 306, 513] to have 513 channels, but got 306 channels instead

Any idea what could be the issue? :-/

vince62s Jan 18, 2022

yes.
remove that transpose here: 7129b04#r64092538
or transpose before calling voice_conversion()

Shivank1006 Jan 19, 2022

Thanks @jreus @vince62s
It worked.

vince62s · 2022-01-18T13:31:27Z

vince62s
Jan 18, 2022

see 7129b04#r64092570

2 replies

jreus Jan 21, 2022
Author

Got it working! Thanks @vince62s :-)
Just curious, is there any recommendation for the number of driving & target speech samples to use?

vince62s Jan 21, 2022

not really but feedbacks are welcome. for me results are very erratic, sometimes ok sometimes bad.

MarsCrop · 2024-06-24T20:52:27Z

MarsCrop
Jun 24, 2024

The structure is generated without predetermining the device

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YourTTS voice conversion for mortals #1084

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

YourTTS voice conversion for mortals #1084

jreus Jan 8, 2022

Replies: 3 comments · 5 replies

Shivank1006 Jan 18, 2022

jreus Jan 18, 2022 Author

vince62s Jan 18, 2022

Shivank1006 Jan 19, 2022

vince62s Jan 18, 2022

jreus Jan 21, 2022 Author

vince62s Jan 21, 2022

MarsCrop Jun 24, 2024

jreus
Jan 8, 2022

Replies: 3 comments 5 replies

Shivank1006
Jan 18, 2022

jreus Jan 18, 2022
Author

vince62s
Jan 18, 2022

jreus Jan 21, 2022
Author

MarsCrop
Jun 24, 2024