VITS, russian, 55k model poor result #1998
Replies: 5 comments 8 replies
-
Update after 100k steps - the result is exactly the same. Not sure where to start investigation but I will train new model |
Beta Was this translation helpful? Give feedback.
-
I'm pretty sure that it's the phonemizer problem. By default they use Gruut, which has relatively poor performance in Russian. I'd suggest you to turn off phonemization for Russian, but keep the normalization part - to extend all numbers, dates and keep all characters in lowercase. |
Beta Was this translation helpful? Give feedback.
-
Hey, do you have any success with this? Trying to do the same right now. |
Beta Was this translation helpful? Give feedback.
-
I fine-tuned ruslan on vits trained on ljspeech, here is the result: ruslan_110k.mp4Sounds not the best, there are weird pauses and in general it still much worse then ljspeech model (from my subjective view). What do you think can be improved? Maybe I should try tacotron + hifi gan? Also, I'm training it along with another private dataset (using speaker embeddings), which has only 10 mins of data and its definitely not enough. What minimum amount of minutes should be enough for vits? |
Beta Was this translation helpful? Give feedback.
-
What are the final results, you managed to achieve high-quality speech? I'm doing the same now, we can help each other) |
Beta Was this translation helpful? Give feedback.
-
Hello, I want to ask for advice what to do with my model. This is my second attempt to train DNN ever, I will apriciate any help. My goal is to create voice cloning pipeline but for now I will be happy just to train something acceptable for pre-production. I am feel bad for bothering with such basic questions, and I will be happy for any information and tips related to my case. Because training is happening in free cloud with gpu quota, I really want to understand as much as possible before my quota resets.
What I have today is a VITS model trained in kaggle during 4 sessions until ~55k+ steps with "ruslan" (russian) dataset resampled to 22050. People were having somewhat decent result with 50k Vits while mine is just started to sound like human and not like space whale noise. Should I just train more? My graphs dont have clear trend and intuitivly I feel like restarting training completly with different config.
Kaggle links:
kaggle notebook
kaggle dataset
I attaching 3 audios from tenserboard "Test Audios"
text: "Я убежден, что мы с Гоголем обладаем равными авторскими правами."
My best result so far, yet it sounds like very very drunk man that have problem with speech.
1.mp4
This sounds like some generic eastern-european words, but not understandable at all
2.mp4
Same as previous audio, not understandable at all
3.mp4
eval.mp4
I also think all 3 of my "evalspectrogram" images should look like vertical lines and not so rectangular.
I will apriciate if someone give me a link to explanation or try to explain with own words how to debug with this graphs.
I will attach my config file as well as you can find the code in kaggle notebook link above. Here I will paste few parameters from my config that I suspect are the reason of poor result:
text_cleaner="multilingual_cleaners"
I saw people using phoneme_cleaners, can it affect quality with this dataset?(its pretty clean):
use_phonemes=True
I saw comments that VITS not using phonemes and comments that enabling phonemes leads for better result, I am confused because I though tokenizer will translate words-to-tokes (words-to-phonemes) anyway, I dont understand how phonemes integrated in model pipeline.
mixed_precision=True
I am pretty solid in understanding low level hardware code and what this actually do, I expect this to affect quality of the result but can it completly "brake" training?
I found characters are the most confusing part of the config, do I need them? russian is in espeak and gruut, removing
"CharactersConfig" completly from the config cause errors.
I also tried to use "find_unique_chars.py" and paste the result in "characters" however I got exception
During "VitsCharacters" investigation I found out that some variables like "is_unique" and "eos/bos/blank" are not passed down the
pipeline and overrided by static variables, means I can completly ignore those variables, is it true?
Should I even use "VitsCharacter" class? Some people using generic character class for their VITS training.
add_blank=False
I added this to get rid of some "character not found, skipping" warnings in logs, this messages was filled with a lot of
<BLNK>
symbol that was super annoying, I heard this can affect quality and I want to turn it back on when I will fix charactersAttaching my config
config.txt
Thank you for reading, Like I said in the beginning - any tip is very welcome.
Beta Was this translation helpful? Give feedback.
All reactions