You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the paper, it is said that the h_bert is the output of the Prosodic Text Encoder module, however, in the code, I believe it supposed to be variable d_en which is not the variable that being fed into other components.
In the implementation, bert_dur acts like h_bert in the paper but as I said above, in the paper, h_bert must bu Prosodic Text Encoder's output, which is d_en in the implementation I believe.
321 s_preds = sampler(noise = torch.randn_like(s_trg).unsqueeze(1).to(device),
322 embedding=bert_dur,
323 embedding_scale=1,
324 features=ref, # reference from the same speaker as the embedding
325 embedding_mask_proba=0.1,
326 num_steps=num_steps).squeeze(1)
327 loss_diff = model.diffusion(s_trg.unsqueeze(1), embedding=bert_dur, features=ref).mean() # EDM loss
328 loss_sty = F.l1_loss(s_preds, s_trg.detach()) # style reconstruction loss
Am I missing something?
The text was updated successfully, but these errors were encountered:
I can't answer your question but I do have an insight.
The paper author himself said that the Paper is basically what this prototype was based off and this prototype was then iterated upon until things worked as they should. Not everything from the Paper will be the same in this repository - and it's possible that the this code part is a result of such code refactor.
In the paper, it is said that the h_bert is the output of the Prosodic Text Encoder module, however, in the code, I believe it supposed to be variable d_en which is not the variable that being fed into other components.
train_second.py
309 bert_dur = model.bert(texts, attention_mask=(~text_mask).int())
310 d_en = model.bert_encoder(bert_dur).transpose(-1, -2)
In the implementation, bert_dur acts like h_bert in the paper but as I said above, in the paper, h_bert must bu Prosodic Text Encoder's output, which is d_en in the implementation I believe.
321 s_preds = sampler(noise = torch.randn_like(s_trg).unsqueeze(1).to(device),
322 embedding=bert_dur,
323 embedding_scale=1,
324 features=ref, # reference from the same speaker as the embedding
325 embedding_mask_proba=0.1,
326 num_steps=num_steps).squeeze(1)
327 loss_diff = model.diffusion(s_trg.unsqueeze(1), embedding=bert_dur, features=ref).mean() # EDM loss
328 loss_sty = F.l1_loss(s_preds, s_trg.detach()) # style reconstruction loss
Am I missing something?
The text was updated successfully, but these errors were encountered: