-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How many hours of audio are needed to train a model? #2
Comments
If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes. |
Remeber that we use GAN loss so Dis loss and Gen loss are "opposite optimization". Best model is choosed from evaluate test loss so you can choose it |
hi, how about multi speakers? is 30 minutes enough? |
I'm trying to training from scratch, I'll inform you when I succeed |
thanks, i've used 2000 samples, it works (but not very high quality) after finetune |
Is this for multi speakers? Could you provide examples of the generated voices? My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned. Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful. |
Do you train on new language ? I add Vietnamese for GPT and use that dataset for Hifigan too and It's working |
More data will improve voice quality |
I fine-tune in Chinese data. GT audio: Test audio: |
I found that the audio files generated in the "synthesis" folder are already unrecognizable in terms of the sentences' content. Is this normal? |
If you use pretrained. The generated audio should match the content in sentences. |
Whether using a pre-trained or fine-tuned xtts model, the differences between the audio in the 'wav' and 'synthesis' folders in the Chinese dataset are greater than in the English dataset, to the extent that the content becomes unintelligible. |
This not happen with Vietnamese. Synthesis and Raw Wav are almost the same. Can you send me some synthesis file and corresponding raw audio |
Raw wav: synthesis: |
i don't know chinese so do the synthesis audio i read correctly with the sentence |
I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable. |
So i think the promblem in the GPT part of yours XTTS model. Because with my language the synthesis audio the pronunciation is true |
I used the pre-trained model # self.hifigan_generator = self.load_hifigan_generator() <----
# # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <---- This ensures that xtts uses the pre-trained HiFi-GAN. I used In theory, both However, I noticed that the Is my approach to this test correct? |
No, the test.py use inference funtion of GPT while generate_latents.py use forward. So output are not the same |
Could you explain in detail the difference between Inference and Forward? |
maybe you could check if you are actually using the pretrained hifi-gan weights. |
May I ask if this requires a lot of epoch runs to find a clear sound, even if it's fine-tuning? |
Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model Am I right? |
Xin chào bạn, @tuanh123789 |
LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan |
Bạn có thể nói rõ hơn về XTTS GPT train được không? Thanks |
Model này chia làm 3 stage: train VAE, GPT, HIFIGAN. Repo này dùng để training HIFIGAN cho XTTS. Để train cho tiếng Việt bạn cần train stage 2: GPT trước với tiếng Việt. Script này thì mình chưa public |
I would also love to know the answer to this :) |
The HiFi GAN is seperate part not End-to-End. Output from GPT part will use to train Hifigan. Finetune Dvae is not necessary |
Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?
The text was updated successfully, but these errors were encountered: