How many hours of audio are needed to train a model? #2

bensonbs · 2024-05-27T10:10:23Z

Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?

tuanh123789 · 2024-05-27T10:34:04Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

tuanh123789 · 2024-05-27T10:38:09Z

Additionally, I want to ask why some losses increase during training. Should I choose the last checkpoint or the best checkpoint?

Remeber that we use GAN loss so Dis loss and Gen loss are "opposite optimization". Best model is choosed from evaluate test loss so you can choose it

hscspring · 2024-05-31T02:18:15Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

tuanh123789 · 2024-05-31T02:48:48Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

hscspring · 2024-05-31T23:59:11Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

bensonbs · 2024-06-02T13:56:08Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

Is this for multi speakers? Could you provide examples of the generated voices?

My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned.

Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful.

tuanh123789 · 2024-06-02T16:07:06Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

Is this for multi speakers? Could you provide examples of the generated voices?

My XTTS-v2 model produces a cracking sound when generating higher-pitched female voices after being fine-tuned.

Therefore, I attempted to fine-tune HiFi-GAN using a dataset of approximately 10,000 multi-speaker samples (the same dataset used for fine-tuning the XTTS-v2 model), but it was unsuccessful.

Do you train on new language ? I add Vietnamese for GPT and use that dataset for Hifigan too and It's working

tuanh123789 · 2024-06-02T16:10:06Z

If you want to train from scratch, i think you need very large dataset. If you want to finetune (note that language of your dataset must in XTTS pretrain language) I think at least 30 minutes.

hi, how about multi speakers? is 30 minutes enough?

I'm trying to training from scratch, I'll inform you when I succeed

thanks, i've used 2000 samples, it works (but not very high quality) after finetune

More data will improve voice quality

bensonbs · 2024-06-02T18:07:57Z

I fine-tune in Chinese data.
If I train for a longer period, will it improve?

GT audio:
https://mork.ro/DqTvY

Test audio:
https://mork.ro/D2hZO

bensonbs · 2024-06-02T18:53:07Z

I found that the audio files generated in the "synthesis" folder are already unrecognizable in terms of the sentences' content. Is this normal?

tuanh123789 · 2024-06-02T19:02:42Z

If you use pretrained. The generated audio should match the content in sentences.

bensonbs · 2024-06-02T19:10:37Z

Whether using a pre-trained or fine-tuned xtts model, the differences between the audio in the 'wav' and 'synthesis' folders in the Chinese dataset are greater than in the English dataset, to the extent that the content becomes unintelligible.

tuanh123789 · 2024-06-02T19:13:44Z

This not happen with Vietnamese. Synthesis and Raw Wav are almost the same. Can you send me some synthesis file and corresponding raw audio

bensonbs · 2024-06-02T19:17:30Z

Raw wav:
https://mork.ro/JdAyO#

synthesis:
https://mork.ro/cJ0ad#

tuanh123789 · 2024-06-02T19:21:30Z

i don't know chinese so do the synthesis audio i read correctly with the sentence

bensonbs · 2024-06-02T19:26:47Z

I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable.

tuanh123789 · 2024-06-02T19:29:23Z

I feel that the model is indeed trying to synthesize the same sentences, but the pronunciation is not standard, making the content unrecognizable.

So i think the promblem in the GPT part of yours XTTS model. Because with my language the synthesis audio the pronunciation is true

bensonbs · 2024-06-03T01:41:01Z

I used the pre-trained model self.xtts_checkpoint = "XTTS-v2/model.pth" to test if the issue lies within the GPT part. I commented out the following two lines in test.py:

# self.hifigan_generator = self.load_hifigan_generator() <----
# # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <----

This ensures that xtts uses the pre-trained HiFi-GAN.

I used wav/094.wav as the speaker_reference and generated the same text content as in metadata.csv for 094.wav. Then, I compared the output with synthesis/094.wav generated by generate_latents.py.

In theory, both generate_latents.py and test.py (with the HiFi-GAN loading lines commented out) should produce the same results since they both use the pre-trained xtts and pre-trained HiFi-GAN.

However, I noticed that the synthesis/094.wav generated by generate_latents.py has mispronunciations, while the xtts_finetune_hifigan.wav generated by test.py (with the HiFi-GAN loading lines commented out) sounds correct.

Is my approach to this test correct?

tuanh123789 · 2024-06-03T02:19:01Z

I used the pre-trained model self.xtts_checkpoint = "XTTS-v2/model.pth" to test if the issue lies within the GPT part. I commented out the following two lines in test.py:
# self.hifigan_generator = self.load_hifigan_generator() <----
# # model.hifigan_decoder.waveform_decoder = self.hifigan_generator <----
This ensures that xtts uses the pre-trained HiFi-GAN.

I used wav/094.wav as the speaker_reference and generated the same text content as in metadata.csv for 094.wav. Then, I compared the output with synthesis/094.wav generated by generate_latents.py.

In theory, both generate_latents.py and test.py (with the HiFi-GAN loading lines commented out) should produce the same results since they both use the pre-trained xtts and pre-trained HiFi-GAN.

However, I noticed that the synthesis/094.wav generated by generate_latents.py has mispronunciations, while the xtts_finetune_hifigan.wav generated by test.py (with the HiFi-GAN loading lines commented out) sounds correct.

Is my approach to this test correct?

No, the test.py use inference funtion of GPT while generate_latents.py use forward. So output are not the same

bensonbs · 2024-06-03T02:39:22Z

Could you explain in detail the difference between Inference and Forward?
Additionally, could you speculate on what might be causing the issue?
Thanks!

hscspring · 2024-06-04T01:33:58Z

maybe you could check if you are actually using the pretrained hifi-gan weights.
notice this: self.model.model_g.load_state_dict(hifigan_state_dict, strict=False), make sure you are loading the pretrained weights.
(otherwise, you are training from scratch)

ScottishFold007 · 2024-06-05T12:11:24Z

May I ask if this requires a lot of epoch runs to find a clear sound, even if it's fine-tuning?

ScottishFold007 · 2024-06-05T12:17:16Z

Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model
2、Fine-tuning the GPT-2 model
3. End-to-end fine-tuning of the whole system using Hi-Fi GAN.

Am I right?

vcstack · 2024-06-14T03:40:02Z

Xin chào bạn, @tuanh123789
Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào?
Nếu muốn traning model XTTS tiếng việt có được hay không?

tuanh123789 · 2024-06-14T03:47:43Z

Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?

LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan

vcstack · 2024-06-14T04:11:51Z

Xin chào bạn, @tuanh123789 Cho tôi hỏi là cái bạn đang làm sử dụng LJ Speech là để tranning ngôn ngữ nào? Nếu muốn traning model XTTS tiếng việt có được hay không?

LJspeech là bộ data tiếng Anh. Muốn train model XTTS tiếng việt được bạn nhé. Với điều kiện bạn phải có phần XTTS GPT train với tiếng Việt để có thể dùng sinh ra data train XTTS Hifigan

Bạn có thể nói rõ hơn về XTTS GPT train được không? Thanks

tuanh123789 · 2024-06-14T04:14:14Z

Model này chia làm 3 stage: train VAE, GPT, HIFIGAN. Repo này dùng để training HIFIGAN cho XTTS. Để train cho tiếng Việt bạn cần train stage 2: GPT trước với tiếng Việt. Script này thì mình chưa public

C00reNUT · 2024-09-09T07:28:52Z

Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model 3. End-to-end fine-tuning of the whole system using Hi-Fi GAN.

Am I right?

I would also love to know the answer to this :)

tuanh123789 · 2024-09-09T07:43:50Z

Is not the training process of XTTS has to be trained sequentially according to the following environment, the output of the previous session is the input of the next session: 1. Fine-tuning of the DVAE model 2、Fine-tuning the GPT-2 model 3. End-to-end fine-tuning of the whole system using Hi-Fi GAN.
Am I right?

I would also love to know the answer to this :)

The HiFi GAN is seperate part not End-to-End. Output from GPT part will use to train Hifigan. Finetune Dvae is not necessary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many hours of audio are needed to train a model? #2

How many hours of audio are needed to train a model? #2

bensonbs commented May 27, 2024

tuanh123789 commented May 27, 2024

tuanh123789 commented May 27, 2024

hscspring commented May 31, 2024

tuanh123789 commented May 31, 2024

hscspring commented May 31, 2024

bensonbs commented Jun 2, 2024

tuanh123789 commented Jun 2, 2024 •

edited

Loading

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 2, 2024

bensonbs commented Jun 2, 2024 •

edited

Loading

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 2, 2024

tuanh123789 commented Jun 2, 2024 •

edited

Loading

bensonbs commented Jun 2, 2024 •

edited

Loading

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 2, 2024

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 3, 2024 •

edited

Loading

tuanh123789 commented Jun 3, 2024

bensonbs commented Jun 3, 2024

hscspring commented Jun 4, 2024

ScottishFold007 commented Jun 5, 2024

ScottishFold007 commented Jun 5, 2024

vcstack commented Jun 14, 2024 •

edited

Loading

tuanh123789 commented Jun 14, 2024

vcstack commented Jun 14, 2024

tuanh123789 commented Jun 14, 2024

C00reNUT commented Sep 9, 2024

tuanh123789 commented Sep 9, 2024

How many hours of audio are needed to train a model? #2

How many hours of audio are needed to train a model? #2

Comments

bensonbs commented May 27, 2024

tuanh123789 commented May 27, 2024

tuanh123789 commented May 27, 2024

hscspring commented May 31, 2024

tuanh123789 commented May 31, 2024

hscspring commented May 31, 2024

bensonbs commented Jun 2, 2024

tuanh123789 commented Jun 2, 2024 • edited Loading

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 2, 2024

bensonbs commented Jun 2, 2024 • edited Loading

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 2, 2024

tuanh123789 commented Jun 2, 2024 • edited Loading

bensonbs commented Jun 2, 2024 • edited Loading

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 2, 2024

tuanh123789 commented Jun 2, 2024

bensonbs commented Jun 3, 2024 • edited Loading

tuanh123789 commented Jun 3, 2024

bensonbs commented Jun 3, 2024

hscspring commented Jun 4, 2024

ScottishFold007 commented Jun 5, 2024

ScottishFold007 commented Jun 5, 2024

vcstack commented Jun 14, 2024 • edited Loading

tuanh123789 commented Jun 14, 2024

vcstack commented Jun 14, 2024

tuanh123789 commented Jun 14, 2024

C00reNUT commented Sep 9, 2024

tuanh123789 commented Sep 9, 2024

tuanh123789 commented Jun 2, 2024 •

edited

Loading

bensonbs commented Jun 2, 2024 •

edited

Loading

tuanh123789 commented Jun 2, 2024 •

edited

Loading

bensonbs commented Jun 2, 2024 •

edited

Loading

bensonbs commented Jun 3, 2024 •

edited

Loading

vcstack commented Jun 14, 2024 •

edited

Loading