-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719
Comments
thanks for report this, F5-TTS/src/f5_tts/model/trainer.py Lines 334 to 356 in f992c4e
yes, if we just picked a long sample, it's actually doing an inference with double length (which surely exceed the max length seen by model)
|
Doesn't the problem arise from this? As you said, multiple solutions are actually possible. We could select from a pool of training data under 15 seconds (I’m not saying a single sample because when training in multiple languages—e.g., English and Chinese, if only an English sample is selected, we might not be able to track progress in Chinese). Alternatively, we could allow the user to provide reference audio and text in advance, similar to a validation set. If we even allow multiple references, the user could track both in-distribution improvements (within the training data) and generalization outside the training data using external references. F5-TTS/src/f5_tts/model/cfm.py Line 93 in 9e51878
F5-TTS/src/f5_tts/model/cfm.py Line 138 in 9e51878
When log_samples is enabled, the duration passed to CFM.sample is calculated as ref_audio_len * 2. seconds = (4096 * 256) / 24000 By the way, my training samples range from 1 sec to 30 sec, and I believe nothing prevents the model from learning from these samples—correct? Only sample generation is affected currently via the sampler? |
Checks
Question details
When training with the log_samples option enabled, generated audio samples are consistently clipped to approximately 15 seconds, even if the original reference audio is longer (e.g., 29 seconds). This happens during the training process when samples are generated for logging purposes. Is this clipping behavior intentional, perhaps to limit resource usage during logging, or is it a bug? Currently, because of this behavior, using a training dataset with clips close to the 30-second limit (but longer than 15 seconds) is problematic, as the generated samples during training are of very poor quality.
The text was updated successfully, but these errors were encountered: