[Bug] Input text cannot be None in attn = generate_path #266

cod3r0k · 2025-01-18T04:10:44Z

Describe the bug

When I want to train multispeaker VITs model, after epoch 2, it occurs an error as below:

`
....
--> TIME: 2025-01-18 05:51:44 -- STEP: 1141/1144 -- GLOBAL_STEP: 228275
| > loss_disc: 2.55525541305542 (2.3107323108484175)
| > loss_disc_real_0: 0.07618683576583862 (0.1314041224111602)
| > loss_disc_real_1: 0.2294345200061798 (0.18530447144423717)
| > loss_disc_real_2: 0.14496399462223053 (0.21591699230472428)
| > loss_disc_real_3: 0.2127830535173416 (0.2248999754539924)
| > loss_disc_real_4: 0.17633846402168274 (0.2232837798131024)
| > loss_disc_real_5: 0.1818462312221527 (0.21921113943134243)
| > loss_0: 2.55525541305542 (2.3107323108484175)
| > grad_norm_0: tensor(19.7260, device='cuda:0') (tensor(25.7604, device='cuda:0'))
| > loss_gen: 2.512892723083496 (2.6231823042753417)
| > loss_kl: 2.0453853607177734 (2.0651548769680894)
| > loss_feat: 7.6552958488464355 (8.288802446972165)
| > loss_mel: 22.356651306152344 (22.661884306189645)
| > loss_duration: 1.5326560735702515 (1.469221028108329)
| > amp_scaler: 128.0 (160.98159509202455)
| > loss_1: 36.10288619995117 (37.10824498536824)
| > grad_norm_1: tensor(128.5965, device='cuda:0') (tensor(290.4952, device='cuda:0'))
| > current_lr_0: 0.0001997002061640866
| > current_lr_1: 0.0001997002061640866
| > step_time: 3.5176 (2.388011707745551)
| > loader_time: 0.021 (0.007880230309355165)

Evaluation:
...
....
--> STEP: 21
| > loss_disc: 2.4334442615509033 (2.429446754001436)
| > loss_disc_real_0: 0.12144245952367783 (0.13465816066378644)
| > loss_disc_real_1: 0.16504763066768646 (0.20793592716966355)
| > loss_disc_real_2: 0.2902933359146118 (0.2909027777966999)
| > loss_disc_real_3: 0.25102439522743225 (0.24391093992051624)
| > loss_disc_real_4: 0.3473176658153534 (0.24767487815448216)
| > loss_disc_real_5: 0.28220587968826294 (0.26010643371513914)
| > loss_0: 2.4334442615509033 (2.429446754001436)
| > loss_gen: 2.770388603210449 (2.59090789159139)
| > loss_kl: 1.5279167890548706 (2.3010178974696567)
| > loss_feat: 9.003446578979492 (7.517528613408406)
| > loss_mel: 25.329435348510742 (22.98664746965681)
| > loss_duration: 1.6732535362243652 (1.4985651345480056)
| > loss_1: 40.304439544677734 (36.894666853405184)

--> STEP: 22
| > loss_disc: 2.23518705368042 (2.420616767623208)
| > loss_disc_real_0: 0.07352343201637268 (0.13187930936163125)
| > loss_disc_real_1: 0.22937196493148804 (0.20891029252247376)
| > loss_disc_real_2: 0.3216579854488373 (0.29230074178088794)
| > loss_disc_real_3: 0.22647826373577118 (0.24311854554848236)
| > loss_disc_real_4: 0.3893234431743622 (0.25411344929174945)
| > loss_disc_real_3: 0.22647826373577118 (0.24311854554848236) [0/1848]
| > loss_disc_real_4: 0.3893234431743622 (0.25411344929174945)
| > loss_disc_real_5: 0.24885831773281097 (0.25959515571594244)
| > loss_0: 2.23518705368042 (2.420616767623208)
| > loss_gen: 3.2849197387695312 (2.622453884644942)
| > loss_kl: 1.5750868320465088 (2.2680210308595137)
| > loss_feat: 10.77985668182373 (7.6658162528818306)
| > loss_mel: 24.645959854125977 (23.062070759859953)
| > loss_duration: 1.6656222343444824 (1.506158639084209)
| > loss_1: 41.951446533203125 (37.12452047521418)

| > Synthesizing test sentences.
Input text cannot be None
! Run is kept in /home/dev/workspace/TTS/recipes/ljspeech/multi_speaker/speaker_multispeaker_fromscratch-January-17-2025_08+17PM-dbf1a08a
Traceback (most recent call last):
File "/home/dev/anaconda3/envs/tts2/lib/python3.9/site-packages/trainer/trainer.py", line 1833, in fit
self._fit()
File "/home/dev/anaconda3/envs/tts2/lib/python3.9/site-packages/trainer/trainer.py", line 1789, in _fit
self.test_run()
File "/home/dev/anaconda3/envs/tts2/lib/python3.9/site-packages/trainer/trainer.py", line 1698, in test_run
test_outputs = self.model.test_run(self.training_assets)
File "/home/dev/anaconda3/envs/tts2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/dev/workspace/TTS/TTS/tts/models/vits.py", line 1442, in test_run
wav, alignment, _, _ = synthesis(
File "/home/dev/workspace/TTS/TTS/tts/utils/synthesis.py", line 221, in synthesis
outputs = run_model_torch(
File "/home/dev/workspace/TTS/TTS/tts/utils/synthesis.py", line 53, in run_model_torch
outputs = _func(
File "/home/dev/anaconda3/envs/tts2/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/dev/workspace/TTS/TTS/tts/models/vits.py", line 1150, in inference
attn = generate_path(w_ceil.squeeze(1), attn_mask.squeeze(1).transpose(1, 2))
IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)`

To Reproduce

Run vits multispeaker for 14 spk

Expected behavior

No response

Logs

Environment

git+https://github.com/coqui-ai/TTS.git@dbf1a08a0d4e47fdad6172e433eeb34bc6b13b4e#egg=TTS

Additional context

No response

eginhard · 2025-01-18T10:32:35Z

Please only report an issue here if you're using the fork's code, the original package hasn't been updated in over a year now. Can you try again with the fork and provide enough details to reproduce (training recipe, config, environment).

cod3r0k added the bug Something isn't working label Jan 18, 2025

cod3r0k changed the title ~~[Bug]~~ [Bug] Input text cannot be None in attn = generate_path Jan 18, 2025

eginhard added question Further information is requested VITS Anything related to VITS/YourTTS/Fairseq models labels Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Input text cannot be None in attn = generate_path #266

[Bug] Input text cannot be None in attn = generate_path #266

cod3r0k commented Jan 18, 2025

eginhard commented Jan 18, 2025

[Bug] Input text cannot be None in attn = generate_path #266

[Bug] Input text cannot be None in attn = generate_path #266

Comments

cod3r0k commented Jan 18, 2025

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

eginhard commented Jan 18, 2025