I trained several TTS models here( Tacotron2-DDC, glowTTS) and HifGAN but all of them is generating noise without speech. #2397
ahmedalbahnasawi
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm dealing with very clean non-English dataset. I mapped the text to phoneme using my G2P model and also managed to write my formatter. samples of my metadata.csv file:
`
001_003_001|fT TIWVDUC T TIWFVC|fT TIWVDUC T TIWFVC
001_004_001|VGSCeC RDQVC b bFUC|VGSCeC RDQVC b bFUC
`
Each letter represent a phoneme in my characters list which contains upper and lower English character and ' space' is word separator. I was able to get nice audios from https://github.com/TensorSpeech/TensorFlowTTS repo but struggles to generate audio files more than 11 secs.
So many people recommend Tacotron2-DDC which can naturally align a 1 minute and 49 second input.
When I train the model the model is not able to generate anything only large electric noise wav .
During inference I debugged the workflow and found that all melspectrogram values are negative.
I'm pretty sure that nothing wrong with Hifigan training as its not dealing with text.
Do you think i have to separate each letter by space and use '-' as word separator and put 'eos' token in every sentence # . example :
`
001_003_001|f T - T I W V D U C - T - T I W F V C|f T - T I W V D U C - T - T I W F V C #
001_004_001|V G S C e C - R D Q V C - b - b F U C|V G S C e C - R D Q V C - b - b F U C #
`
Many Thanks
Beta Was this translation helpful? Give feedback.
All reactions