Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Real time streaming support instead of text chunk stream #700

Open
5 tasks done
saifulislam79 opened this issue Jan 8, 2025 · 1 comment
Open
5 tasks done

Real time streaming support instead of text chunk stream #700

saifulislam79 opened this issue Jan 8, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@saifulislam79
Copy link

Checks

  • This template is only for feature request.
  • I have thoroughly reviewed the project documentation but couldn't find any relevant information that meets my needs.
  • I have searched for existing issues, including closed ones, and found not discussion yet.
  • I confirm that I am using English to submit this report in order to facilitate communication.

1. Is this request related to a challenge you're experiencing? Tell us your story.

I have worked few days of the project f5-tts and i am grateful to author because they are active and give response with short time. My question: here i have found chunk stream of f5-tts but is it possible real time stream like as xtts v2 stream inference byte label stream or have any possibility add stream inference instead of chunk stream

2. What is your suggested solution?

i have found some inference code like stream but it merge and used cross fade,
` # inference
with torch.inference_mode():
generated, _ = model_obj.sample(
cond=audio,
text=final_text_list,
duration=duration,
steps=nfe_step,
cfg_strength=cfg_strength,
sway_sampling_coef=sway_sampling_coef,
)

        generated = generated.to(torch.float32)
        generated = generated[:, ref_audio_len:, :]
        generated_mel_spec = generated.permute(0, 2, 1)
        if mel_spec_type == "vocos":
            generated_wave = vocoder.decode(generated_mel_spec)
        elif mel_spec_type == "bigvgan":
            generated_wave = vocoder(generated_mel_spec)
        if rms < target_rms:
            generated_wave = generated_wave * rms / target_rms`

https://github.com/SWivid/F5-TTS/blob/main/src/f5_tts/infer/utils_infer.py line number 455. is it possible generated wav yield as a stream

3. Additional context or comments

Details already share into above section

4. Can you help us with this feature?

  • I am interested in contributing to this feature.
@saifulislam79 saifulislam79 added the enhancement New feature or request label Jan 8, 2025
@SWivid
Copy link
Owner

SWivid commented Jan 10, 2025

i have found some inference code like stream but it merge and used cross fade,

it's just chunk inference

welcome pr~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants