VITS, russian, 55k model poor result #1998

freezerain · 2022-09-20T14:51:30Z

freezerain
Sep 20, 2022

Hello, I want to ask for advice what to do with my model. This is my second attempt to train DNN ever, I will apriciate any help. My goal is to create voice cloning pipeline but for now I will be happy just to train something acceptable for pre-production. I am feel bad for bothering with such basic questions, and I will be happy for any information and tips related to my case. Because training is happening in free cloud with gpu quota, I really want to understand as much as possible before my quota resets.

What I have today is a VITS model trained in kaggle during 4 sessions until ~55k+ steps with "ruslan" (russian) dataset resampled to 22050. People were having somewhat decent result with 50k Vits while mine is just started to sound like human and not like space whale noise. Should I just train more? My graphs dont have clear trend and intuitivly I feel like restarting training completly with different config.
Kaggle links:
kaggle notebook
kaggle dataset
I attaching 3 audios from tenserboard "Test Audios"
text: "Я убежден, что мы с Гоголем обладаем равными авторскими правами."
My best result so far, yet it sounds like very very drunk man that have problem with speech.

1.mp4

text: "Для Тихомирова я был чересчур изыскан."
This sounds like some generic eastern-european words, but not understandable at all

2.mp4

text: "Вы ранили мое сердце."
Same as previous audio, not understandable at all

3.mp4

I also attaching tenserboard "Eval audio", but its sound like 2 random letters from alphabet.

eval.mp4

This is my tenserboard "images" tab, as I understand "EvalFigures/alignment" should be diagonal from corner to corner, yet mine looks halfed from very start of the training, I have hard time searching for explanation to how exactly read each graph and what can be cause of anomalies, I found 2y old topic on firefox forum but there were only very general tips.

I also think all 3 of my "evalspectrogram" images should look like vertical lines and not so rectangular.

I will apriciate if someone give me a link to explanation or try to explain with own words how to debug with this graphs.

This is my tenserboard "scalars" tab, I am not sure what graphs should I investigate, but even wtihout experience its clearly all over the place, I expect "avg_loss_0" and "avg_loss_1" to be more like exponential decay shape. Any coments on this graphs?

I will attach my config file as well as you can find the code in kaggle notebook link above. Here I will paste few parameters from my config that I suspect are the reason of poor result:

text_cleaner="multilingual_cleaners"
I saw people using phoneme_cleaners, can it affect quality with this dataset?(its pretty clean):
use_phonemes=True
I saw comments that VITS not using phonemes and comments that enabling phonemes leads for better result, I am confused because I though tokenizer will translate words-to-tokes (words-to-phonemes) anyway, I dont understand how phonemes integrated in model pipeline.
mixed_precision=True
I am pretty solid in understanding low level hardware code and what this actually do, I expect this to affect quality of the result but can it completly "brake" training?

characters=CharactersConfig(
        characters_class="TTS.tts.models.vits.VitsCharacters",
        pad="<PAD>",
        characters="!¡'(),-.:;¿?abcdefghijklmnopqrstuvwxyzµßàáâäåæçèéêëìíîïñòóôöùúûüąćęłńœśşźżƒабвгдежзийклмнопрстуфхцчшщъыьэюяёєіїґӧ «°±µ»$%&‘’‚“`”„",
        punctuations="!¡'(),-.:;¿? ",
    ),

I found characters are the most confusing part of the config, do I need them? russian is in espeak and gruut, removing
"CharactersConfig" completly from the config cause errors.
I also tried to use "find_unique_chars.py" and paste the result in "characters" however I got exception
During "VitsCharacters" investigation I found out that some variables like "is_unique" and "eos/bos/blank" are not passed down the
pipeline and overrided by static variables, means I can completly ignore those variables, is it true?
Should I even use "VitsCharacter" class? Some people using generic character class for their VITS training.

add_blank=False
I added this to get rid of some "character not found, skipping" warnings in logs, this messages was filled with a lot of <BLNK> symbol that was super annoying, I heard this can affect quality and I want to turn it back on when I will fix characters

Attaching my config
config.txt

Thank you for reading, Like I said in the beginning - any tip is very welcome.

freezerain · 2022-09-28T11:17:06Z

freezerain
Sep 28, 2022
Author

Update after 100k steps - the result is exactly the same. Not sure where to start investigation but I will train new model

0 replies

shigabeev · 2022-10-05T11:08:49Z

shigabeev
Oct 5, 2022

I'm pretty sure that it's the phonemizer problem. By default they use Gruut, which has relatively poor performance in Russian. I'd suggest you to turn off phonemization for Russian, but keep the normalization part - to extend all numbers, dates and keep all characters in lowercase.

2 replies

freezerain Oct 5, 2022
Author

Thank you, I will try it next model

lexkoro Oct 5, 2022
Collaborator

I think espeak should have decent support for Russian

ilyalasy · 2022-11-14T08:46:46Z

ilyalasy
Nov 14, 2022

Hey, do you have any success with this? Trying to do the same right now.

3 replies

shigabeev Nov 14, 2022

Use original espeak instead of the gruut which is used here by default. Gruut was a good attempt to distill espeak, but it doesn't work good in Russian.

ilyalasy Nov 14, 2022

Yeah, i'm just asking if someone has sucessfully trained vits on ruslan and has some weights to share

freezerain Nov 15, 2022
Author

Rn I am not training this config, if I will have any result in the future I will tag you

ilyalasy · 2022-11-16T13:51:50Z

ilyalasy
Nov 16, 2022

I fine-tuned ruslan on vits trained on ljspeech, here is the result:
text: "Для Тихомирова я был чересчур изыскан."

ruslan_110k.mp4

Sounds not the best, there are weird pauses and in general it still much worse then ljspeech model (from my subjective view).

What do you think can be improved? Maybe I should try tacotron + hifi gan?

Also, I'm training it along with another private dataset (using speaker embeddings), which has only 10 mins of data and its definitely not enough. What minimum amount of minutes should be enough for vits?

2 replies

nikich340 Jan 14, 2023

5 hours is minimum for decent more or less stable results.

stillonearth Jul 21, 2023

Are those ruslan weights available for fine-tuning?

SkyWar-design · 2023-04-04T19:53:11Z

SkyWar-design
Apr 4, 2023

What are the final results, you managed to achieve high-quality speech? I'm doing the same now, we can help each other)

1 reply

RankoR Jun 10, 2024

Hi, same question to you from 2024 :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VITS, russian, 55k model poor result #1998

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

VITS, russian, 55k model poor result #1998

Replies: 5 comments · 8 replies

freezerain Sep 28, 2022 Author

freezerain Oct 5, 2022 Author

lexkoro Oct 5, 2022 Collaborator

freezerain Nov 15, 2022 Author

Replies: 5 comments 8 replies

freezerain
Sep 28, 2022
Author

freezerain Oct 5, 2022
Author

lexkoro Oct 5, 2022
Collaborator

freezerain Nov 15, 2022
Author