Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues Encountered When Generating Chinese Audio #139

Open
JV-X opened this issue Feb 20, 2025 · 5 comments
Open

Issues Encountered When Generating Chinese Audio #139

JV-X opened this issue Feb 20, 2025 · 5 comments

Comments

@JV-X
Copy link

JV-X commented Feb 20, 2025

Hello, I'm trying to run sample.py on my computer to generate a Chinese audio clip. I placed an audio file of my own voice named mine.wav in the assets directory and modified sample.py as follows:

import torch
import torchaudio
from zonos.model import Zonos
from zonos.conditioning import make_cond_dict
from zonos.utils import DEFAULT_DEVICE as device
import time

# model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-hybrid", device=device)
model = Zonos.from_pretrained("Zyphra/Zonos-v0.1-transformer", device=device)
print(f'model loaded...     device is: {device}')
# wav, sampling_rate = torchaudio.load("assets/exampleaudio.mp3")
wav, sampling_rate = torchaudio.load("assets/mine.wav")
speaker = model.make_speaker_embedding(wav, sampling_rate)

torch.manual_seed(421)

start = time.time()

cond_dict = make_cond_dict(text="你好你好这里是十个字", speaker=speaker, language="cmn")
conditioning = model.prepare_conditioning(cond_dict)

step = time.time()
print(f'cost: {step - start}')
codes = model.generate(conditioning)
print(f'cost: {time.time() - step}')
wavs = model.autoencoder.decode(codes).cpu()
torchaudio.save("sample.wav", wavs[0], model.autoencoder.sampling_rate)

However, I encountered a strange issue. The generating progress bar stops at 11%, and the program exits without reaching 100%. But I noticed that the sample.wav file was indeed generated in the directory. When I played this file, it contained meaningless sounds instead of the text I used for generation.

You can download the sample.wav file from the following link:
https://c.wss.cc/f/gcftpu5gp3f
https://www.wenshushu.cn/f/gcftpu5gp3f


log:

(base) hygx@hygx:~/code/Zonos$  cd /home/hygx/code/Zonos ; /usr/bin/env /home/hygx/anaconda3/envs/zonos/bin/python /home/hygx/.vscode-server/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 54979 -- /home/hygx/code/Zonos/sample.py 
model loaded...     device is: cuda:0
cost: 0.26287078857421875
Generating:  10%|████▏                                   | 271/2588 [00:36<00:19, 121.35it/s]
cost: 40.23396587371826
Generating:  11%|████▍                                    | 278/2588 [00:36<05:05,  7.57it/s]
(base) hygx@hygx:~/code/Zonos$

Additionally, I observed that generating ten characters took more than 40 seconds.
Could you please help me understand why this happened and if there is a solution?
Thank you for your response.

@rzgarespo
Copy link

rzgarespo commented Feb 20, 2025

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py)
no issue, it just rendered 271 frames out of max allowed frames of 2588

@JV-X
Copy link
Author

JV-X commented Feb 20, 2025

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py) no issue, it just rendered 271 frames out of max allowed frames of 2588

Sorry about that, but I didn't quite understand what you meant. Does what you said have anything to do with this issue?

@rzgarespo
Copy link

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py) no issue, it just rendered 271 frames out of max allowed frames of 2588

Sorry about that, but I didn't quite understand what you meant. Does what you said have anything to do with this issue?

271 almost means your audio file is 10% of 30 seconds. 2588 means 30 seconds.

@JV-X
Copy link
Author

JV-X commented Feb 21, 2025

line: 136 (https://github.com/Zyphra/Zonos/blob/main/gradio_interface.py) no issue, it just rendered 271 frames out of max allowed frames of 2588

Sorry about that, but I didn't quite understand what you meant. Does what you said have anything to do with this issue?

271 almost means your audio file is 10% of 30 seconds. 2588 means 30 seconds.

I tried trimming mine.wav to 29 seconds, but the log didn't change much.

(base) hygx@hygx:~/code/Zonos$  cd /home/hygx/code/Zonos ; /usr/bin/env /home/hygx/anaconda3/envs/zonos/bin/python /home/hygx/.vscode-server/extensions/ms-python.debugpy-2025.0.1-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 60349 -- /home/hygx/code/Zonos/sample.py 
model loaded...     device is: cuda:0
cost: 0.29413795471191406
Generating:  11%|████▎                                   | 280/2588 [00:35<00:17, 131.72it/s]
cost: 39.00833225250244
Generating:  11%|████▍                                    | 281/2588 [00:35<04:47,  8.01it/s]
(base) hygx@hygx:~/code/Zonos$ 

And this process is not very time-consuming, it takes nearly 40 seconds, The generated audio also sounds weird and doesn't sound like Chinese

@xuekedou
Copy link

I used the same code to generate Chinese audio and also encountered the problem that the speech did not sound like Chinese.
When I try to use the hybrid model, an error will be reported. Only the transformer model is supported.

File "/Zonos/zonos/backbone/_torch.py", line 57, in init
assert not config.ssm_cfg, "This backbone implementation only supports the Transformer model."
^^^^^^^^^^^^^^^^^^
AssertionError: This backbone implementation only supports the Transformer model.

Could you please give me a tutorial on generating Chinese audio?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants