This repo will guide you to add your own character voices, or even your own voice, into an existing VITS TTS model to make it able to do the following tasks in less than 1 hour:
- Any-to-any voice conversion between you & any characters you added & preset characters
- English, Japanese & Chinese Text-to-Speech synthesis with the characters you added & preset characters
Welcome to play around with the base model, a Trilingual Anime VITS!
- Convert user's voice to characters listed here
- Chinese, English, Japanese TTS with user's voice
- Chinese, English, Japanese TTS with custom characters!
- Umamusume Pretty Derby (Used as base model pretraining)
- Sanoba Witch (Used as base model pretraining)
- Genshin Impact (Used as base model pretraining)
- Any character you wish as long as you have their voices!
It's recommended to perform fine-tuning on Google Colab because the original VITS has some dependencies that are difficult to configure.
- Install dependencies (2 min)
- Record at least 20 your own voice, the content to read will be presented in UI, less than 20 words per sentence. (5~10 min)
- Upload your character voices, which should be a
.zip
file, it's file structure should be like:
Your-zip-file.zip
├───Character_name_1
├ ├───xxx.wav
├ ├───...
├ ├───yyy.mp3
├ └───zzz.wav
├───Character_name_2
├ ├───xxx.wav
├ ├───...
├ ├───yyy.mp3
├ └───zzz.wav
├───...
├
└───Character_name_n
├───xxx.wav
├───...
├───yyy.mp3
└───zzz.wav
Note that the format & name of the audio files does not matter as long as they are audio files.
Audio quality requirements: >=2s, <=20s per audio, background noise should be as less as possible.
Audio quantity requirements: at least 10 per character, better if 20+ per character.
You can either choose to perform step 2, 3, or both, depending on your needs.
- Fine-tune (30 min)
After everything is done, download the fine-tuned model & model config
- Remember to download your fine-tuned model!
- Download the latest release
- Put your model & config file into the folder
inference
, make sure to rename the model toG_latest.pth
and config file tofinetune_speaker.json
- The file structure should be as follows:
inference
├───inference.exe
├───...
├───finetune_speaker.json
└───G_latest.pth
- run
inference.exe
, the browser should pop up automatically.