Question about phonemes (and its durations) extraction during Xtts inference #3721

Tetsuo-tek · 2024-05-06T07:59:17Z

Tetsuo-tek
May 6, 2024

Hi. I'm using XTTS_V2 within an application for a 3D avatar-equipped assistant that can speak; previously I used libespeak to generate the voice from which I also retrieved the phonemes and their duration in milliseconds. The retrieved phonemes allow me to move the avatar's mouth by generating lip movements for each generated phoneme. As you know, the voice generated in this way is very robotic and very unnatural, so I decided to use XTTS after trying it out. The only problem I would like to resolve, in order to successfully implement XTTS in my project, is capturing the generated phonemes with their duration and positioning inside the PCM generated. I have been investigating the XTTS code, but I don't understand if it's possible to obtain the phonemes and their durations.

The question: is there a way to efficiently obtain this information instead of reprocessing the produced audio to get it?
Thank you.

eginhard · 2024-05-06T11:51:54Z

eginhard
May 6, 2024

XTTS doesn't use phonemes, so you'd have to derive such data yourself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about phonemes (and its durations) extraction during Xtts inference #3721

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Question about phonemes (and its durations) extraction during Xtts inference #3721

Tetsuo-tek May 6, 2024

Replies: 1 comment

eginhard May 6, 2024

Tetsuo-tek
May 6, 2024

eginhard
May 6, 2024