[Bug][Feature Request] Can't quantize vits models and ONNX quantization working slower than non-quantized version. #2991

mllopartbsc · 2023-07-13T09:44:02Z

mllopartbsc
Jul 13, 2023

Describe the bug

Hi everyone. I was trying to quantize a vits model and I tried used two approaches. On the one hand, I've tried converting the model and then applying dynamic quantization to it. On the other hand, I've tried using Pytorch's dynamic quantization function on the model's checkpoint. The first mentioned method delivered inference runtimes more than twice as slow as normal ones, while the second method didn't work because of this error:

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

I was wondering if anyone has been able to successfully quantize a vits model and achieved better performance results for inference.

Kind Regards.

To Reproduce

This first code tries to quantize a VITS model that had been previously converted to ONNX.

In order to run this scripts you'll need to download the config.json and model_file.pth files from this link, save them in you preferred location and change the paths.

import numpy as np
import os
import json
import sys
import time
import torch

from pathlib import Path
from TTS.tts.models.vits import Vits
from TTS.tts.configs.vits_config import VitsConfig
from TTS.utils.audio.numpy_transforms import save_wav
from TTS.utils.manage import ModelManager

config = VitsConfig()
config.load_json("/home/mllopart/PycharmProjects/ONNX/models/vits_ca/config.json")
vits = Vits.init_from_config(config)
vits.load_checkpoint(config,  "/home/mllopart/PycharmProjects/ONNX/models/vits_ca/model_file.pth")

vits.export_onnx()
vits.load_onnx("coqui_vits.onnx")

text1 = "The field of space exploration has continually fascinated humanity, igniting the collective imagination and driving scientific and technological advancement. From the first successful launch of Sputnik 1 by the USSR in 1957, it became clear that space was a new frontier, ripe for exploration. Space exploration has offered us a unique vantage point to better understand our universe, revealing startling and wondrous phenomena like black holes, nebulae, and countless galaxies far beyond our own. It has also allowed us to study our home planet in ways that would have been impossible from the ground, enhancing our understanding of Earth's atmosphere, weather systems, and the impact of human activity on the global environment."

text_inputs1 = np.asarray(
    vits.tokenizer.text_to_ids(text1, language="en"),
    dtype=np.int64,
)[None, :]

start = time.time()
audio1 = vits.inference_onnx(text_inputs1)
end = time.time()
print("Inference 1 Time Taken: ", end - start, " seconds")

You'll see how the quantized version runs way slower than the non-quantized version.

Here's the non-quantized script of the same model for comparison.

import time
from TTS.utils.manage import ModelManager
from TTS.utils.synthesizer import Synthesizer
import torch

# torch.set_num_threads(1)

path = "/home/mllopart/PycharmProjects/ONNX/venv/lib/python3.10/site-packages/TTS/.models.json"

model_manager = ModelManager(path)

model_path, config_path, model_item = model_manager.download_model("tts_models/en/ljspeech/vits")

syn = Synthesizer(
    tts_checkpoint=model_path,
    tts_config_path=config_path,
)

text1 = "The field of space exploration has continually fascinated humanity, igniting the collective imagination and driving scientific and technological advancement. From the first successful launch of Sputnik 1 by the USSR in 1957, it became clear that space was a new frontier, ripe for exploration. Space exploration has offered us a unique vantage point to better understand our universe, revealing startling and wondrous phenomena like black holes, nebulae, and countless galaxies far beyond our own. It has also allowed us to study our home planet in ways that would have been impossible from the ground, enhancing our understanding of Earth's atmosphere, weather systems, and the impact of human activity on the global environment."

start_time = time.time()
outputs1 = syn.tts(text1)
end_time = time.time()
print(f"Time taken for inference 1: {end_time - start_time} seconds")
syn.save_wav(outputs1, "normal_1.wav")

This third script tries to quantize a VITS model using Pytorch's dynamic quantization function

import numpy as np
import os
import json
import sys
import time
import torch

from pathlib import Path
from TTS.tts.models.vits import Vits
from TTS.tts.configs.vits_config import VitsConfig
from TTS.utils.audio.numpy_transforms import save_wav
from TTS.utils.manage import ModelManager

config = VitsConfig()
config.load_json("/home/mllopart/PycharmProjects/ONNX/models/vits_ca/config.json")
vits = Vits.init_from_config(config)
vits.load_checkpoint(config,  "/home/mllopart/PycharmProjects/ONNX/models/vits_ca/model_file.pth")

for param in vits.parameters():
    param.data = param.data.detach()

vits.requires_grad_(False)

vits.eval()

quantized_model = torch.quantization.quantize_dynamic(vits, {torch.nn.Linear}, dtype=torch.qint8)

print(quantized_model)

When running this script, I get this error:

RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

And I already tried changing the vits.py

load_checkpoint function to this: 
    def load_checkpoint(self, config, checkpoint_path, eval=False, strict=True, cache=False):
        """Load the model checkpoint and setup for training or inference"""
        state = load_fsspec(checkpoint_path, map_location=torch.device("cpu"), cache=cache)

        # compat band-aid for the pre-trained models to not use the encoder baked into the model
        # TODO: consider baking the speaker encoder into the model and call it from there.
        # as it is probably easier for model distribution.
        state["model"] = {k: v for k, v in state["model"].items() if "speaker_encoder" not in k}

        if self.args.encoder_sample_rate is not None and eval:
            # audio resampler is not used in inference time
            self.audio_resampler = None

        # handle fine-tuning from a checkpoint with additional speakers
        if hasattr(self, "emb_g") and state["model"]["emb_g.weight"].shape != self.emb_g.weight.shape:
            num_new_speakers = self.emb_g.weight.shape[0] - state["model"]["emb_g.weight"].shape[0]
            print(f" > Loading checkpoint with {num_new_speakers} additional speakers.")
            emb_g = state["model"]["emb_g.weight"]
            new_row = torch.randn(num_new_speakers, emb_g.shape[1])
            emb_g = torch.cat([emb_g, new_row], axis=0)
            state["model"]["emb_g.weight"] = emb_g

        # Detach all tensors in state dict
        for key in state["model"]:
            state["model"][key] = state["model"][key].detach()

        # load the model weights
        self.load_state_dict(state["model"], strict=strict)

        if eval:
            self.eval()
            assert not self.training

In order to detach all functions from the state dict.

Expected behavior

No response

Logs

No response

Environment

- onnx runtime verion: 18

Additional context

This are the specifications of the system I'm running it with.

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               GenuineIntel
  Model name:            11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
    CPU family:          6
    Model:               140
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            1
    CPU max MHz:         4800,0000
    CPU min MHz:         400,0000
    BogoMIPS:            3609.60
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
                         ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
                          arch_perfmon pebs bts rep_good nopl xtopology nonstop_
                         tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes6
                         4 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xt
                         pr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_dead
                         line_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowp
                         refetch cpuid_fault epb cat_l2 invpcid_single cdp_l2 ss
                         bd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpr
                         iority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 sm
                         ep bmi2 erms invpcid rdt_a avx512f avx512dq rdseed adx 
                         smap avx512ifma clflushopt clwb intel_pt avx512cd sha_n
                         i avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves spli
                         t_lock_detect dtherm ida arat pln pts hwp hwp_notify hw
                         p_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku os
                         pke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx51
                         2_bitalg tme avx512_vpopcntdq rdpid movdiri movdir64b f
                         srm avx512_vp2intersect md_clear ibt flush_l1d arch_cap
                         abilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   192 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    5 MiB (4 instances)
  L3:                    12 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB fillin
                         g, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Nanayeb34 · 2023-07-22T05:47:10Z

Nanayeb34
Jul 22, 2023

Hello @mllopartbsc
For the first script that you provided, which part of the script quantizes the onnx format? All I see is you running inference with the onnx version of the model.

0 replies

mllopartbsc · 2023-07-31T11:51:43Z

mllopartbsc
Jul 31, 2023
Author

Hi @Nanayeb34, I've just updated the post, I hope it makes sense now.

Also, I think the reason why the dynamic quantized model runs slower than the non-quantized version is because dynamic quantization produces overhead when there are a lot of CNN.

In regards to the Pytorch's quantization, I still don't know the source of the error.

Kind Regards.

0 replies

Nanayeb34 · 2023-08-16T09:13:10Z

Nanayeb34
Aug 16, 2023

@mllopartbsc I think I may have made some progress. Apparently, the quantization is failing because we can't make a deepcopy of the model parameters. I created a copy of my custom fine tuned model and made inplace=True in the quantize dynamic function.

quantized_model = quantize_dynamic(vits, {nn.Linear,nn.Conv2d,nn.Conv1d}, dtype=torch.qint8,inplace=True)

This succeeded and I was able to quantize the model. However running inference with it produces the ff error.
`

Using model: vits
Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > log_func:np.log10
| > min_level_db:0
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:None
| > fft_size:1024
| > power:None
| > preemphasis:0.0
| > griffin_lim_iters:None
| > signal_norm:None
| > symmetric_norm:None
| > mel_fmin:0
| > mel_fmax:None
| > pitch_fmin:None
| > pitch_fmax:None
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:1.0
| > clip_norm:True
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:False
| > db_level:None
| > stats_path:None
| > base:10
| > hop_length:256
| > win_length:1024
Model fully restored.
Setting up Audio Processor...
| > sample_rate:16000
| > resample:False
| > num_mels:64
| > log_func:np.log10
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:512
| > power:1.5
| > preemphasis:0.97
| > griffin_lim_iters:60
| > signal_norm:False
| > symmetric_norm:False
| > mel_fmin:0
| > mel_fmax:8000.0
| > pitch_fmin:1.0
| > pitch_fmax:640.0
| > spec_gain:20.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:False
| > do_trim_silence:False
| > trim_db:60
| > do_sound_norm:False
| > do_amp_to_db_linear:True
| > do_amp_to_db_mel:True
| > do_rms_norm:True
| > db_level:-27.0
| > stats_path:None
| > base:10
| > hop_length:160
| > win_length:400
External Speaker Encoder Loaded !!
Traceback (most recent call last):
File "/usr/local/bin/tts", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/TTS/bin/synthesize.py", line 396, in main
synthesizer = Synthesizer(
File "/usr/local/lib/python3.10/dist-packages/TTS/utils/synthesizer.py", line 91, in init
self._load_tts(tts_checkpoint, tts_config_path, use_cuda)
File "/usr/local/lib/python3.10/dist-packages/TTS/utils/synthesizer.py", line 190, in _load_tts
self.tts_model.load_checkpoint(self.tts_config, tts_checkpoint, eval=True)
File "/usr/local/lib/python3.10/dist-packages/TTS/tts/models/vits.py", line 1706, in load_checkpoint
state["model"] = {k: v for k, v in state["model"].items() if "speaker_encoder" not in k}
KeyError: 'model'`

It seems the quantization is not being properly done because the model state_dict keys are empty. You can find my code here and model here. The config file is here

0 replies

mllopartbsc · 2023-08-17T16:30:58Z

mllopartbsc
Aug 17, 2023
Author

Hi @Nanayeb34 thank you for your efforts. In regards to this error in inference, what would be the next steps to follow to find a solution? And how could I help with that?

I'll try reproducing your quantization to see what I can find. Do you know why the state_dict keys are not being stored properly?

Also, the link you sent is not working properly.

Kind Regards.

0 replies

Nanayeb34 · 2023-09-12T17:42:13Z

Nanayeb34
Sep 12, 2023

Sorry about that @mllopartbsc . I messed up the links. I have updated it now hence it should be working.

On state_dict not being stored, I am at a loss at to why. I tried out something different by quantizing the default vits model that coqui provides using your approach above and the state_dict keys were missing as well

0 replies

mllopartbsc · 2023-09-27T10:42:22Z

mllopartbsc
Sep 27, 2023
Author

Hi @Nanayeb34, from the code you provided, where exactly do you export the non-ONNX dynamically quantized model? And where do you import it back and then load it as a checkpoint for inference?

0 replies

sushant-m-dev · 2023-10-18T08:48:28Z

sushant-m-dev
Oct 18, 2023

Hi @mllopartbsc , this is not related to your thread but from the first approach you posted how can I download the audio as a wav file that vits.inference_onnx is returning

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][Feature Request] Can't quantize vits models and ONNX quantization working slower than non-quantized version. #2991

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[Bug][Feature Request] Can't quantize vits models and ONNX quantization working slower than non-quantized version. #2991

mllopartbsc Jul 13, 2023

Describe the bug

To Reproduce

This first code tries to quantize a VITS model that had been previously converted to ONNX.

Here's the non-quantized script of the same model for comparison.

This third script tries to quantize a VITS model using Pytorch's dynamic quantization function

Expected behavior

Logs

Environment

Additional context

Replies: 7 comments

Nanayeb34 Jul 22, 2023

mllopartbsc Jul 31, 2023 Author

Nanayeb34 Aug 16, 2023

mllopartbsc Aug 17, 2023 Author

Nanayeb34 Sep 12, 2023

mllopartbsc Sep 27, 2023 Author

sushant-m-dev Oct 18, 2023

mllopartbsc
Jul 13, 2023

Nanayeb34
Jul 22, 2023

mllopartbsc
Jul 31, 2023
Author

Nanayeb34
Aug 16, 2023

mllopartbsc
Aug 17, 2023
Author

Nanayeb34
Sep 12, 2023

mllopartbsc
Sep 27, 2023
Author

sushant-m-dev
Oct 18, 2023