To be able to use a model for Inference tasks, Harmony Speech Engine needs to load the model on startup. As soon as the model has been loaded, it is eglible for processing direct requests or requests as part of the internal routing chain of a framework.
The following Models and Speech Frameworks are currently supported. YAML config examples are being provided.
- Native / Third-Party Models: build on top of other Repositories. This is the preferred way of adding new models or frameworks to Harmony Speech Engine, since it minimizes implementation effort.
- Adapted models: have their Neural Network Module code and parts of their repository's code copied over into Harmony Speech Engine's codebase to allow for optimizations and supporting these models independent of the original group or individual who released it. Not all models or Frameworks are well maintained and released with processing optimizations ot toolchain interoperability in mind. So if the License allows it, we take the existing codebase and apply the necessary adaptions to allow for best performance when doing inference.
Whisper is a SOTA Audio Transcription model developed by OpenAI. (Original Repo)
Faster-Whisper is a Performance-Improved Version of Whisper, reimplemented usinc Ctranslate2.
Harmony Speech Engine is compatible with all Faster Whisper and Faster Distil-Whisper models supported by SYSTRAN's Implementation.
Supports Local model path or Huggingface-Resolving by Model Name as described in the github repo below. Just put the
name of the desired model into the model
field of the configuration.
Implementation Type: Native / Third-Party
Links:
model_configs:
- name: "faster-whisper"
model: "large-v3-turbo"
model_type: "FasterWhisper"
max_batch_size: 16
dtype: "float32"
device_config:
device: "cuda:0"
Harmony Speech is a High Performance One-Shot-Voice-Cloning TTS Engine developed by Project Harmony.AI, which has been optimized for CPU-Only, Faster-Than-Realtime Inference. It currently only supports an English speaking Synthesizer.
It consists of 3 Models:
- Speaker Encoder Model (Based on CorentinJ's G2E Implementation)
- Speech Synthesis Model (Based on Forward-Tacotron)
- Vocoder Model (Based on kan-bayashi's MultiBand-MelGAN implementation)
Harmony Speech V1 model weights are being released under the Apache License.
Supports Local model path or Huggingface-Resolving by Model Name.
Implementation Type: Adapted
Links:
model_configs:
- name: "hs1-encoder"
model: "harmony-ai/harmony-speech-v1"
model_type: "HarmonySpeechEncoder"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "hs1-synthesizer"
model: "harmony-ai/harmony-speech-v1"
model_type: "HarmonySpeechSynthesizer"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "hs1-vocoder"
model: "harmony-ai/harmony-speech-v1"
model_type: "HarmonySpeechVocoder"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
(Speaker Encoder is only needed if you want to create new embeddings or do a full voice clone)
OpenVoice is a TTS Framework created by MyShell AI. V1 focuses on voice cloning for English and Chinese language. Also it supports emotional voice styles for English.
It consists of 2 Models:
- Speaker Encoder & Voice Conversion Model
- Single Speaker TTS Model
The Voice conversion model is trained to convert Voices between two Speakers using Vocal alignment. Technically, it is possible to do that between arbitrary speakers, but best results can be achieved when using the Single Speaker TTS model outputs.
Additionaly, the embedding step requires a VAD / ASR model like Whisper to function.
Supports Local model path or Huggingface-Resolving by Model Name.
Implementation Type: Adapted
Supported Language Tags | Supported Voice IDs / Styles |
---|---|
EN | default, whispering, shouting, excited, cheerful, terrified, angry, sad, friendly |
ZH | default |
Links:
model_configs:
- name: "ov1-synthesizer-en"
model: "myshell-ai/openvoice"
model_type: "OpenVoiceV1Synthesizer"
language: "EN"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "ov1-synthesizer-zh"
model: "myshell-ai/openvoice"
model_type: "OpenVoiceV1Synthesizer"
language: "ZH"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "ov1-tone-converter"
model: "myshell-ai/openvoice"
model_type: "OpenVoiceV1ToneConverter"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "ov1-tone-converter-encoder"
model: "myshell-ai/openvoice"
model_type: "OpenVoiceV1ToneConverterEncoder"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "faster-whisper"
model: "large-v3"
model_type: "FasterWhisper"
max_batch_size: 16
dtype: "float32"
device_config:
device: "cuda:0"
(For best performance, Harmony Speech Engine runs OpenVoice models "task-based", which means we define different model types for Encoding and Voice Conversion, despite it's the same weights. Since the TTS model weights differ per language, we need to load all languages we want to support in parallel.)
OpenVoice is a TTS Framework created by MyShell AI. V2 focuses on voice cloning for multiple languages, building on top of a new TTS System called MeloTTS Also it supports multiple dialect styles for English.
It consists of 2 Models:
- Speaker Encoder & Voice Conversion Model
- Single Speaker TTS Model
The Voice conversion model is trained to convert Voices between two Speakers using Vocal alignment. Technically, it is possible to do that between arbitrary speakers, but best results can be achieved when using the Single Speaker TTS model outputs.
Additionaly, the embedding step requires a VAD / ASR model like Whisper to function.
Supports Local model path or Huggingface-Resolving by Model Name.
Implementation Type: Adapted
Model Name | Supported Language Tags | Supported Voice IDs / Styles |
---|---|---|
MeloTTS-English | EN (English) | EN-Default, EN-US, EN-BR, EN-INDIA, EN-AU |
MeloTTS-English-v2 | EN (English) | EN-US, EN-BR, EN-INDIA, EN-AU |
MeloTTS-English-v3 | EN (English) | EN-Newest |
MeloTTS-Chinese | ZH (Chinese) | ZH |
MeloTTS-Spanish | ES (Spanish) | ES |
MeloTTS-French | FR (French) | FR |
MeloTTS-Japanese | JP (Japanese) | JP |
MeloTTS-Korean | KR (Korean) | KR |
Links:
ATTENTION: Only one model per language is currently supported; you can NOT route to English-v3 and English-v2 at the same time currently.
model_configs:
- name: "ov2-synthesizer-en"
model: "myshell-ai/MeloTTS-English-v3"
model_type: "MeloTTSSynthesizer"
language: "EN"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "ov2-synthesizer-zh"
model: "myshell-ai/MeloTTS-Chinese"
model_type: "MeloTTSSynthesizer"
language: "ZH"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "ov2-tone-converter"
model: "myshell-ai/openvoicev2"
model_type: "OpenVoiceV2ToneConverter"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "ov2-tone-converter-encoder"
model: "myshell-ai/openvoicev2"
model_type: "OpenVoiceV2ToneConverterEncoder"
max_batch_size: 10
dtype: "float32"
device_config:
device: "cpu"
- name: "faster-whisper"
model: "large-v3"
model_type: "FasterWhisper"
max_batch_size: 16
dtype: "float32"
device_config:
device: "cuda:0"
(For best performance, Harmony Speech Engine runs OpenVoice models "task-based", which means we define different model types for Encoding and Voice Conversion, despite it's the same weights. Since the TTS model weights differ per language, we need to load all languages we want to support in parallel.)