[Bug] Streaming inference does not work #4118

1640675651 · 2024-12-31T05:23:56Z

Describe the bug

Tried the streaming code at https://docs.coqui.ai/en/latest/models/xtts.html#streaming-manually
with use_deepspeed=False on CPU. Got error: AttributeError: 'int' object has no attribute '_pad_token_tensor'

To Reproduce

import os
import time
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts

print("Loading model...")
config = XttsConfig()
config.load_json("/path/to/xtts/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/path/to/xtts/", use_deepspeed=False)
#model.cuda()

print("Computing speaker latents...")
gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=["reference.wav"])

print("Inference...")
t0 = time.time()
chunks = model.inference_stream(
"It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
"en",
gpt_cond_latent,
speaker_embedding
)

wav_chuncks = []
for i, chunk in enumerate(chunks):
if i == 0:
print(f"Time to first chunck: {time.time() - t0}")
print(f"Received chunk {i} of audio length {chunk.shape[-1]}")
wav_chuncks.append(chunk)
wav = torch.cat(wav_chuncks, dim=0)
torchaudio.save("xtts_streaming.wav", wav.squeeze().unsqueeze(0).cpu(), 24000)

Expected behavior

It should output generated audio

Logs

/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/TTS/tts/layers/xtts/stream_generator.py:138: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:818: UserWarning: `return_dict_in_generate` is NOT set to `True`, but `output_hidden_states` is. When `return_dict_in_generate` is not `True`, `output_hidden_states` is ignored.
  warnings.warn(
Traceback (most recent call last):
  File "/Users/zhz/Desktop/paradigm/conversation_playground/xtts/xtts_streaming.py", line 31, in <module>
    for i, chunk in enumerate(chunks):
  File "/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 36, in generator_context
    response = gen.send(None)
  File "/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 652, in inference_stream
    gpt_generator = self.gpt.get_generator(
  File "/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 603, in get_generator
    return self.gpt_inference.generate_stream(
  File "/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/TTS/tts/layers/xtts/stream_generator.py", line 186, in generate
    model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
  File "/Users/zhz/miniconda3/envs/xtts/lib/python3.10/site-packages/transformers/generation/utils.py", line 585, in _prepare_attention_mask_for_generation
    pad_token_id = generation_config._pad_token_tensor
AttributeError: 'int' object has no attribute '_pad_token_tensor'

Environment

'{
    "CUDA": {
        "GPU": [],
        "available": false,
        "version": null
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.5.1",
        "TTS": "0.22.0",
        "numpy": "1.22.0"
    },
    "System": {
        "OS": "Darwin",
        "architecture": [
            "64bit",
            ""
        ],
        "processor": "arm",
        "python": "3.10.16",
        "version": "Darwin Kernel Version 24.1.0: Thu Oct 10 21:06:23 PDT 2024; root:xnu-11215.41.3~3/RELEASE_ARM64_T8132"
    }
}

Additional context

No response

sinangokce · 2024-12-31T12:13:29Z

I'm facing the exact same issue as well.

eginhard · 2024-12-31T12:16:33Z

You can use our fork (via pip install coqui-tts), it works with recent versions of transformers. This repo is not updated anymore.

sinangokce · 2024-12-31T12:20:28Z

@eginhard Thanks! Can you share the installation steps? I'm on Ubuntu 22.04 and when I run the streaming script I receive the error: cannot import name 'log' from 'torch.distributed.elastic.agent.server.api'

sinangokce · 2024-12-31T12:36:30Z

I found out that the deepspeed version 0.10.3 documented at https://coqui-tts.readthedocs.io/en/latest/models/xtts.html caused this issue. I installed the version 0.14.4 as proposed at huggingface/alignment-handbook#180. It solved the issue.

1640675651 added the bug Something isn't working label Dec 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Streaming inference does not work #4118

[Bug] Streaming inference does not work #4118

1640675651 commented Dec 31, 2024

sinangokce commented Dec 31, 2024

eginhard commented Dec 31, 2024

sinangokce commented Dec 31, 2024 •

edited

Loading

sinangokce commented Dec 31, 2024

[Bug] Streaming inference does not work #4118

[Bug] Streaming inference does not work #4118

Comments

1640675651 commented Dec 31, 2024

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

sinangokce commented Dec 31, 2024

eginhard commented Dec 31, 2024

sinangokce commented Dec 31, 2024 • edited Loading

sinangokce commented Dec 31, 2024

sinangokce commented Dec 31, 2024 •

edited

Loading