-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] [TTS] Support SSML in input text #752
Comments
Seconded! SSML would be a very valuable value-add to the TTS. It would be specially useful for controlling pauses, linebreaks, emotion (if possible, using heightened pitch), urgency (by increasing the speed of spoken text to 1.5x). It would also be useful in multi-speaker models where we would give speaker ID in the SSML itself and Coqui would string them together. Though this would be a stretch goal to the basic SSML implementation. Please let us know how we can help in this. Is there an SSML implementation you need us to research? (like gruut was integrated, perhaps we can integrate an existing SSML framework as well). Is there some code we can contribute? |
Which SSML tags/properties do you think would be the most valuable to implement? |
Well, here are the tags that would be most relevant according to me (in some order of relevance) -
There are a lot of implementations and features that are non-standard, but can be very useful, such as Sources - By the way, I have a question about how SSML is implemented in neural TTS - I do not understand how the SSML tags would be translated to the voice. Would we need to train models which have different pitch, pauses, and volumes? Would we need to train models that know how to pronounce certain words we ask them to spell out (like USA, AWS, ISIS etc)? Could you help me understand how this would be implemented? |
@nitinthewiz thx for the great post. All the use-cases make sense, however, implementing SSML required a lot of effort. I think we can start implementing some of the basic functionalities and expand them as we go. I don't know when we can start implementing SSML but I add it to our task list here #378 When it comes to your question, some basic manipulations (speed, volume, etc.) are straightforward to implement with a single model. However, some needs model-level architectural changes or improvements, as you noted, emotions, pitch, and so on. |
@erogol I would be interested in starting on this. Some tags can be handled by gruut, such as It may be worth (me) implementing support for PLS lexicons as well, so users could expand gruut's vocabulary. |
@erogol thanks a lot for following up and for the explanation! @synesthesiam let me know how I can help with the lexicon, or once you've implemented it, we can start contributing to the vocab. |
Small update: I've got preliminary SSML functionality in a side branch of gruut now with support for:
Numbers, dates, currency, and initialisms are automatically detected and verbalized. I've gone the extra mile and made full use of the <speak>
<w lang="en_US">1</w>
<w lang="es_ES">1</w>
</speak> verbalized as "one uno". This works for dates, etc. as well, and can even generate phonemes from different languages in the same document. I imagine this could be used in 🐸 TTS with The biggest thing that would help me in completing this feature is deciding on the default settings for non-English languages:
|
I think default formats need to be handled by the text normalizer in a way that the model can read. Is this what you also mean? @synesthesiam ? |
Yes, and also the normalization needs to mirror what the speaker likely did when reading the text. So when gruut comes across "4/1/2021" in text, it needs to come out as the most likely verbalization in the given language/locale. For U.S. English, "4/1/2021" becomes "April first twenty twenty one". For German, it is "Januar vierte zweitausendeinundzwanzig" instead, which I'm hoping is the right thing to do. Regarding punctuation, I know that dashes and underscores (and event camelCasing) can be used to break apart English words for the purpose of phonemization -- "ninety" and "nine" are likely in the lexicon, but "ninety-nine" may not be. But this gets more complicated in French: "est-que" is present in the lexicon and is not the same as the phonemes("est") + phonemes("que"). So what I'm doing now is checking the lexicon first, and only breaking words if they're not present. |
@erogol It might be worth moving this to a discussion I've completed my first prototype of 🐸 TTS with SSML support (currently here)! I'm using a gruut side branch for now (supported SSML tags). Now something like this works: SSML=$(cat << EOF
<speak>
<s lang="en">123</s>
<s lang="de">123</s>
<s lang="es">123</s>
<s lang="fr">123</s>
<s lang="nl">123</s>
</speak>
EOF
)
python3 TTS/bin/synthesize.py \
--model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
--extra_model_name tts_models/de/thorsten/tacotron2-DCA \
--extra_model_name tts_models/es/mai/tacotron2-DDC \
--extra_model_name tts_models/fr/mai/tacotron2-DDC \
--extra_model_name tts_models/nl/mai/tacotron2-DDC\
--text "$SSML" --ssml true --out_path ssml.wav Which outputs a WAV file with:
Before getting any deeper, I wanted to see if I'm on the right track. The three main changes I've made are:
SynthesizerI created a The
If no voice or language is specified, the default voice is used. Command-LineThe python3 TTS/server/server.py \
--model_name tts_models/en/ljspeech/tacotron2-DDC_ph \
--extra_model_name tts_models/de/thorsten/tacotron2-DCA \
--extra_model_name tts_models/en/vctk/vits The default voice is specified as normal (with Additionally, the Web UIThe two web UI changes are:
|
@synesthesiam it is a great start to SSML!!! I think we should also decide how we want to land SSML to the library architecture. Before saying anything, I'd be interested in hearing your opinions about that. |
The biggest change so far is SSML being able to reference multiple voices and languages. In the future, Architecturally with SSML, text processing, model loading, and synthesis are all tied together at roughly the same abstraction level. Some important questions are:
Is the user (or code) required to pre-load all relevant models, or could it happen dynamically? If dynamic, is the user able to specify a custom model? Perhaps the With the proper use of a
Text processing with Phonemization is no longer an independent stage either, since the gruut's Maybe 🐸 TTS could plug user-defined functions into this pipeline? They don't have to operate on the whole graph, many of mine just word on a single word at a time. For example, this code converts numbers into words for any language supported by Depending on where you are in the pipeline, user-defined functions could also operate specifically on numbers, dates, currency, etc. I have code, for instance, that verbalizes numbers as years similar to your code, but done in a (mostly) language-independent manner. I'll stop before this gets any more long-winded as see what your thoughts are 🙂 |
I think users should define not only the language but also the model name and we can load models dynamically. Something like Threading would be a nice perf improvement too.
I think before we go and solve SSML we need to write up a My understanding of But also the SSMLParser should know what options are available for the chosen model since different models support different sets of SSML tags. |
A default model for each dataset may be worth it too. So, "ljspeech" could default to whatever the best sounding model is currently.
This sounds reasonable, though I suspect over time that the [1] For example, is "1234" a cardinal number, ordinal number, year, or digits? The
Another approach is to parse everything into the metadata and leave it up to the |
When I say Tokenizer, I mean something that the model can also use in Training. So as there is no use for SSML in training, it makes sense to use the Tokenizer as the base class I guess. But I mostly agree with you for inference. Tokenizer can have preprocess, tokenize, and postprocess steps and we can deal with the contextual information in the preprocess step by providing the right set of preprocessing steps for the selected language. I don't like the "ignoring" idea since then the user does not really know what really works and what doesn't . To me, defining the available tags based on the selected models makes more sense. But it is also definitely harder than just ignoring. Maybe we should start by ignoring for simplicity. |
I'll implement a proof of concept with the Tokenizer idea 👍 |
oh no |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
not stale |
TokenizerAPI is WIP #1079 |
@synesthesiam any updates? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
No activity does not mean that it is not important anymore. |
The |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels. |
Hey I started hacking some basic SSML support using Gruut as parsing. |
@WeberJulian is there an update on adding SSML support to coqui? Thanks for the info! |
I think this feature is even much more relevant now. Consider that, for instance, many future research projects will need alternatives for Google TTS (which is quite strong in SSML) because of the European Union's ambition to strengthen solutions that contribute to trustworthy AI. Is anyone capable of outlining what the specific bottleneck for this feature is? What makes it difficult to implement? |
i really liked where this is going. what are the chances it could get merged wtih main so we can continue with it? |
Having the possibility to add user defined pauses in the speech would be great. |
Any update? |
Nobody's working on it from the dev team. |
For what it is worth, i consider this absolutely essential. I hope very much that this is re-opened and worked towards. you kind of quickly hit a brick wall of what you can do without SSML present |
Any updates? 👀 |
Still hanging around to see if there is any progress... |
Any update?? |
Do we have basic support of SSML now? Or is it only supported in the Gruut branch? |
No we don't have SSML and no timeline for it unless someone is contributing it. |
only sad reactions :'( |
Is your feature request related to a problem? Please describe.
For TTS, there is a need to choose a specific model or send additional data to the engine on how to handle a part of the text. Examples:
Describe the solution you'd like
Support SSML / coqui markup in input text. Example:
The text was updated successfully, but these errors were encountered: