Question about phonemes (and its durations) extraction during Xtts inference #3721
Unanswered
Tetsuo-tek
asked this question in
General Q&A
Replies: 1 comment
-
XTTS doesn't use phonemes, so you'd have to derive such data yourself. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi. I'm using XTTS_V2 within an application for a 3D avatar-equipped assistant that can speak; previously I used libespeak to generate the voice from which I also retrieved the phonemes and their duration in milliseconds. The retrieved phonemes allow me to move the avatar's mouth by generating lip movements for each generated phoneme. As you know, the voice generated in this way is very robotic and very unnatural, so I decided to use XTTS after trying it out. The only problem I would like to resolve, in order to successfully implement XTTS in my project, is capturing the generated phonemes with their duration and positioning inside the PCM generated. I have been investigating the XTTS code, but I don't understand if it's possible to obtain the phonemes and their durations.
The question: is there a way to efficiently obtain this information instead of reprocessing the produced audio to get it?
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions