Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataloader and duration for inference #716

Open
4 tasks done
yl4579 opened this issue Jan 13, 2025 · 3 comments
Open
4 tasks done

dataloader and duration for inference #716

yl4579 opened this issue Jan 13, 2025 · 3 comments
Labels
question Further information is requested

Comments

@yl4579
Copy link

yl4579 commented Jan 13, 2025

Checks

  • This template is only for question, not feature requests or bug reports.
  • I have thoroughly reviewed the project documentation and read the related paper(s).
  • I have searched for existing issues, including closed ones, no similar questions.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Question details

In #653, I encountered a problem with generalizing to more extended audio when training from scratch. Upon further investigation, I found the problem may occur because of the dataloader. More specifically, when the data loader type is "frame", each batch has the same total duration, and the only thing that varies is the number of samples. For example, if the duration sampled is 5 seconds of audio, the batch could have 60 samples, while if the audio is 30 seconds long, it could only contain 10 samples.

This means that the model sees fewer samples per batch for longer audios. Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.

Is this the reason why you chunk the input text during inference? Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?

@yl4579 yl4579 added the question Further information is requested label Jan 13, 2025
@SWivid
Copy link
Owner

SWivid commented Jan 14, 2025

the issue in #653 is different from your description here.
in #653 it is that you cannot do inference with unseen length in training

Is this the reason why you chunk the input text during inference?

chunk is just to make sure the total length <30s, cuz Emilia training set has up-to-30-sec audio samples.
we can for sure do inference for total length of 30s if 30s seen in training.

Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?

of course you could have difference length samples in one batch (you need to rewrite the sampler), and apply proper padding mask otherwise attention to much padding tokens.
and the training will take more time cuz with more padding and additional computation of mask (will also need to customize flash_attn rather than native pytorch one)

Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.

the total amount of training data is same, and the loss is counted for batch (same duration/frames inside) but not each sample.
so not quite get your point, why it's harder to learn here?

@yl4579
Copy link
Author

yl4579 commented Jan 15, 2025

the issue in #653 is different from your description here. in #653 it is that you cannot do inference with unseen length in training

Is this the reason why you chunk the input text during inference?

chunk is just to make sure the total length <30s, cuz Emilia training set has up-to-30-sec audio samples. we can for sure do inference for total length of 30s if 30s seen in training.

Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?

of course you could have difference length samples in one batch (you need to rewrite the sampler), and apply proper padding mask otherwise attention to much padding tokens. and the training will take more time cuz with more padding and additional computation of mask (will also need to customize flash_attn rather than native pytorch one)

Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.

the total amount of training data is same, and the loss is counted for batch (same duration/frames inside) but not each sample. so not quite get your point, why it's harder to learn here?

I was asking this question because I was trying to make this model sample with fewer steps and it seems like not working well for longer audio and it will cause alignment issues between text and speech. For large diffusion steps (>32) longer audio works, but for smaller diffusion steps (<8) longer audio fails.

This could be because the model didn't use cross attention, and instead it infuses text input using concatenation and a simple convolution. Intuitively, what happens is that the first layer, the text tokens and audio are fused, and the model has to attend to the first N frames (N being the length of the text) to infer the text information, but as the layer gets deeper and deeper, this text information is diluted as the first N frames become more and more "speech-like". This is different from cross attention, where at each layer the text information is injected and there is no mix-over between speech and text at later layers.

I saw there's a class called "MMDiT" but it is never used which uses cross-attention. You mentioned that MMDiT results in failure in #96 and I tested it myself and it does result in failure (text-speech alignment fails and it produces gibberish). But interestingly, if I change the dataloader from "frame" to "sample" with input masks, it starts to work fine.

What is the reason for this? Have you investigated the difference between cross attention vs. ConvNeXt and the dataloader in terms of "sample" and "frame"? This happened when I was working on DiTToTTS-like model too. When no input mask was provided, the model produces gibberish, but when there is input mask, the model works fine, and it only happens when cross-attention is used.

I'd like to talk more about this for a new project if you are interested. Is it possible to have your email or other contact information so we can chat more?

@SWivid
Copy link
Owner

SWivid commented Jan 16, 2025

yes for sure, email at https://arxiv.org/pdf/2410.06885
we could exchange contact info there, e.g. wechat if you use?

  1. MMDiT could work to some extent, it learns alignment really fast (but much duplicated utterance, hallucination) but collapses fast (timbre and content) in our experiments. So I personally think a double-then-single-stream transformer may be better (flux-like)
  2. Cross-attention needs more computation and adds model size if same depth, and we are not sure if simpler way than DiTTo-TTS (LM + LM-infused Codec) can achieve good result

MMDiT ... But interestingly, if I change the dataloader from "frame" to "sample" with input masks, it starts to work fine.

Yes it's very interesting.
If could get acceptable result with "sample" setting to MMDiT (e.g. will it converge to get rid of much duplication?), this structure can be used since it's more training and inference efficient.
Have you tried input mask with "frame" setting, this failed for MMDiT?


Yep we could make further discussion, I also send you an email (thought campus address?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants