Issues Encountered When Generating Chinese Audio #139

JV-X · 2025-02-20T09:36:08Z

Hello, I'm trying to run sample.py on my computer to generate a Chinese audio clip. I placed an audio file of my own voice named mine.wav in the assets directory and modified sample.py as follows:

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
from zonos.utils import DEFAULT_DEVICE as device
import time

# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device=device)
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device=device)
print(f'model loaded...     device is: {device}')
# wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
wav, sampling_rate = torchaudio.load("assets/mine.wav")
speaker = model.make_speaker_embedding(wav, sampling_rate)

torch.manual_seed(421)

start = time.time()

cond_dict = make_cond_dict(text="你好你好这里是十个字", speaker=speaker, language="cmn")
conditioning = model.prepare_conditioning(cond_dict)

step = time.time()
print(f'cost: {step - start}')
codes = model.generate(conditioning)
print(f'cost: {time.time() - step}')
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

However, I encountered a strange issue. The generating progress bar stops at 11%, and the program exits without reaching 100%. But I noticed that the sample.wav file was indeed generated in the directory. When I played this file, it contained meaningless sounds instead of the text I used for generation.

You can download the sample.wav file from the following link:
https://c.wss.cc/f/gcftpu5gp3f
https://www.wenshushu.cn/f/gcftpu5gp3f

log:

(base) hygx@hygx:~/code/Zonos$  cd /home/hygx/code/Zonos ; /usr/bin/env /home/hygx/anaconda3/envs/zonos/bin/python /home/hygx/.vscode-server/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 54979 -- /home/hygx/code/Zonos/sample.py 
model loaded...     device is: cuda:0
cost: 0.26287078857421875
Generating:  10%|████▏                                   | 271/2588 [00:36<00:19, 121.35it/s]
cost: 40.23396587371826
Generating:  11%|████▍                                    | 278/2588 [00:36<05:05,  7.57it/s]
(base) hygx@hygx:~/code/Zonos$

Additionally, I observed that generating ten characters took more than 40 seconds.
Could you please help me understand why this happened and if there is a solution?
Thank you for your response.

The text was updated successfully, but these errors were encountered:

rzgarespo · 2025-02-20T10:57:26Z

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py)
no issue, it just rendered 271 frames out of max allowed frames of 2588

JV-X · 2025-02-20T11:38:48Z

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py) no issue, it just rendered 271 frames out of max allowed frames of 2588

Sorry about that, but I didn't quite understand what you meant. Does what you said have anything to do with this issue?

rzgarespo · 2025-02-20T11:43:52Z

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py) no issue, it just rendered 271 frames out of max allowed frames of 2588

Sorry about that, but I didn't quite understand what you meant. Does what you said have anything to do with this issue?

271 almost means your audio file is 10% of 30 seconds. 2588 means 30 seconds.

JV-X · 2025-02-21T01:45:43Z

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py) no issue, it just rendered 271 frames out of max allowed frames of 2588

Sorry about that, but I didn't quite understand what you meant. Does what you said have anything to do with this issue?

271 almost means your audio file is 10% of 30 seconds. 2588 means 30 seconds.

I tried trimming mine.wav to 29 seconds, but the log didn't change much.

(base) hygx@hygx:~/code/Zonos$  cd /home/hygx/code/Zonos ; /usr/bin/env /home/hygx/anaconda3/envs/zonos/bin/python /home/hygx/.vscode-server/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 60349 -- /home/hygx/code/Zonos/sample.py 
model loaded...     device is: cuda:0
cost: 0.29413795471191406
Generating:  11%|████▎                                   | 280/2588 [00:35<00:17, 131.72it/s]
cost: 39.00833225250244
Generating:  11%|████▍                                    | 281/2588 [00:35<04:47,  8.01it/s]
(base) hygx@hygx:~/code/Zonos$

And this process is not very time-consuming, it takes nearly 40 seconds， The generated audio also sounds weird and doesn't sound like Chinese

xuekedou · 2025-02-21T11:40:42Z

I used the same code to generate Chinese audio and also encountered the problem that the speech did not sound like Chinese.
When I try to use the hybrid model, an error will be reported. Only the transformer model is supported.

File "/Zonos/zonos/backbone/_torch.py", line 57, in init
assert not config.ssm_cfg, "This backbone implementation only supports the Transformer model."
^^^^^^^^^^^^^^^^^^
AssertionError: This backbone implementation only supports the Transformer model.

Could you please give me a tutorial on generating Chinese audio?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues Encountered When Generating Chinese Audio #139

Issues Encountered When Generating Chinese Audio #139

JV-X commented Feb 20, 2025

rzgarespo commented Feb 20, 2025 •

edited

Loading

JV-X commented Feb 20, 2025

rzgarespo commented Feb 20, 2025

JV-X commented Feb 21, 2025

xuekedou commented Feb 21, 2025

Issues Encountered When Generating Chinese Audio #139

Issues Encountered When Generating Chinese Audio #139

Comments

JV-X commented Feb 20, 2025

rzgarespo commented Feb 20, 2025 • edited Loading

JV-X commented Feb 20, 2025

rzgarespo commented Feb 20, 2025

JV-X commented Feb 21, 2025

xuekedou commented Feb 21, 2025

rzgarespo commented Feb 20, 2025 •

edited

Loading