-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dataloader and duration for inference #716
Comments
the issue in #653 is different from your description here.
chunk is just to make sure the total length <30s, cuz Emilia training set has up-to-30-sec audio samples.
of course you could have difference length samples in one batch (you need to rewrite the sampler), and apply proper padding mask otherwise attention to much padding tokens.
the total amount of training data is same, and the loss is counted for batch (same duration/frames inside) but not each sample. |
I was asking this question because I was trying to make this model sample with fewer steps and it seems like not working well for longer audio and it will cause alignment issues between text and speech. For large diffusion steps (>32) longer audio works, but for smaller diffusion steps (<8) longer audio fails. This could be because the model didn't use cross attention, and instead it infuses text input using concatenation and a simple convolution. Intuitively, what happens is that the first layer, the text tokens and audio are fused, and the model has to attend to the first N frames (N being the length of the text) to infer the text information, but as the layer gets deeper and deeper, this text information is diluted as the first N frames become more and more "speech-like". This is different from cross attention, where at each layer the text information is injected and there is no mix-over between speech and text at later layers. I saw there's a class called "MMDiT" but it is never used which uses cross-attention. You mentioned that MMDiT results in failure in #96 and I tested it myself and it does result in failure (text-speech alignment fails and it produces gibberish). But interestingly, if I change the dataloader from "frame" to "sample" with input masks, it starts to work fine. What is the reason for this? Have you investigated the difference between cross attention vs. ConvNeXt and the dataloader in terms of "sample" and "frame"? This happened when I was working on DiTToTTS-like model too. When no input mask was provided, the model produces gibberish, but when there is input mask, the model works fine, and it only happens when cross-attention is used. I'd like to talk more about this for a new project if you are interested. Is it possible to have your email or other contact information so we can chat more? |
yes for sure, email at https://arxiv.org/pdf/2410.06885
Yes it's very interesting. Yep we could make further discussion, I also send you an email (thought campus address?) |
Checks
Question details
In #653, I encountered a problem with generalizing to more extended audio when training from scratch. Upon further investigation, I found the problem may occur because of the dataloader. More specifically, when the data loader type is "frame", each batch has the same total duration, and the only thing that varies is the number of samples. For example, if the duration sampled is 5 seconds of audio, the batch could have 60 samples, while if the audio is 30 seconds long, it could only contain 10 samples.
This means that the model sees fewer samples per batch for longer audios. Longer audio is inherently hard to learn, and we are providing fewer samples in a batch, making it even harder to estimate the score (or flow) accurately.
Is this the reason why you chunk the input text during inference? Have you tested changing the dataloder type to "sample" so each batch contains varied lengths? What's the difference here?
The text was updated successfully, but these errors were encountered: