dataloader and duration for inference #716

yl4579 · 2025-01-13T19:22:18Z

Checks

This template is only for question, not feature requests or bug reports.
I have thoroughly reviewed the project documentation and read the related paper(s).
I have searched for existing issues, including closed ones, no similar questions.
I confirm that I am using English to submit this report in order to facilitate communication.

Question details

In #653, I encountered a problem with generalizing to more extended audio when training from scratch. Upon further investigation, I found the problem may occur because of the dataloader. More specifically, when the data loader type is "frame", each batch has the same total duration, and the only thing that varies is the number of samples. For example, if the duration sampled is 5 seconds of audio, the batch could have 60 samples, while if the audio is 30 seconds long, it could only contain 10 samples.

This means that the model sees fewer samples per batch for longer audios. Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.

Is this the reason why you chunk the input text during inference? Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?

SWivid · 2025-01-14T07:42:04Z

the issue in #653 is different from your description here.
in #653 it is that you cannot do inference with unseen length in training

Is this the reason why you chunk the input text during inference?

chunk is just to make sure the total length <30s, cuz Emilia training set has up-to-30-sec audio samples.
we can for sure do inference for total length of 30s if 30s seen in training.

Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?

of course you could have difference length samples in one batch (you need to rewrite the sampler), and apply proper padding mask otherwise attention to much padding tokens.
and the training will take more time cuz with more padding and additional computation of mask (will also need to customize flash_attn rather than native pytorch one)

Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.

the total amount of training data is same, and the loss is counted for batch (same duration/frames inside) but not each sample.
so not quite get your point, why it's harder to learn here?

yl4579 · 2025-01-15T22:27:12Z

the issue in #653 is different from your description here. in #653 it is that you cannot do inference with unseen length in training

Is this the reason why you chunk the input text during inference?

chunk is just to make sure the total length <30s, cuz Emilia training set has up-to-30-sec audio samples. we can for sure do inference for total length of 30s if 30s seen in training.

Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?

of course you could have difference length samples in one batch (you need to rewrite the sampler), and apply proper padding mask otherwise attention to much padding tokens. and the training will take more time cuz with more padding and additional computation of mask (will also need to customize flash_attn rather than native pytorch one)

Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.

the total amount of training data is same, and the loss is counted for batch (same duration/frames inside) but not each sample. so not quite get your point, why it's harder to learn here?

I was asking this question because I was trying to make this model sample with fewer steps and it seems like not working well for longer audio and it will cause alignment issues between text and speech. For large diffusion steps (>32) longer audio works, but for smaller diffusion steps (<8) longer audio fails.

This could be because the model didn't use cross attention, and instead it infuses text input using concatenation and a simple convolution. Intuitively, what happens is that the first layer, the text tokens and audio are fused, and the model has to attend to the first N frames (N being the length of the text) to infer the text information, but as the layer gets deeper and deeper, this text information is diluted as the first N frames become more and more "speech-like". This is different from cross attention, where at each layer the text information is injected and there is no mix-over between speech and text at later layers.

I saw there's a class called "MMDiT" but it is never used which uses cross-attention. You mentioned that MMDiT results in failure in #96 and I tested it myself and it does result in failure (text-speech alignment fails and it produces gibberish). But interestingly, if I change the dataloader from "frame" to "sample" with input masks, it starts to work fine.

What is the reason for this? Have you investigated the difference between cross attention vs. ConvNeXt and the dataloader in terms of "sample" and "frame"? This happened when I was working on DiTToTTS-like model too. When no input mask was provided, the model produces gibberish, but when there is input mask, the model works fine, and it only happens when cross-attention is used.

I'd like to talk more about this for a new project if you are interested. Is it possible to have your email or other contact information so we can chat more?

SWivid · 2025-01-16T06:42:42Z

yes for sure, email at https://arxiv.org/pdf/2410.06885
we could exchange contact info there, e.g. wechat if you use?

MMDiT could work to some extent, it learns alignment really fast (but much duplicated utterance, hallucination) but collapses fast (timbre and content) in our experiments. So I personally think a double-then-single-stream transformer may be better (flux-like)
Cross-attention needs more computation and adds model size if same depth, and we are not sure if simpler way than DiTTo-TTS (LM + LM-infused Codec) can achieve good result

MMDiT ... But interestingly, if I change the dataloader from "frame" to "sample" with input masks, it starts to work fine.

Yes it's very interesting.
If could get acceptable result with "sample" setting to MMDiT (e.g. will it converge to get rid of much duplication?), this structure can be used since it's more training and inference efficient.
Have you tried input mask with "frame" setting, this failed for MMDiT?

Yep we could make further discussion, I also send you an email (thought campus address?)

yl4579 added the question Further information is requested label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataloader and duration for inference #716

dataloader and duration for inference #716

yl4579 commented Jan 13, 2025

SWivid commented Jan 14, 2025

yl4579 commented Jan 15, 2025 •

edited

Loading

SWivid commented Jan 16, 2025

dataloader and duration for inference #716

dataloader and duration for inference #716

Comments

yl4579 commented Jan 13, 2025

Checks

Question details

SWivid commented Jan 14, 2025

yl4579 commented Jan 15, 2025 • edited Loading

SWivid commented Jan 16, 2025

yl4579 commented Jan 15, 2025 •

edited

Loading