Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719

hcsolakoglu · 2025-01-14T01:32:42Z

Checks

This template is only for question, not feature requests or bug reports.
I have thoroughly reviewed the project documentation and read the related paper(s).
I have searched for existing issues, including closed ones, no similar questions.
I confirm that I am using English to submit this report in order to facilitate communication.

Question details

When training with the log_samples option enabled, generated audio samples are consistently clipped to approximately 15 seconds, even if the original reference audio is longer (e.g., 29 seconds). This happens during the training process when samples are generated for logging purposes. Is this clipping behavior intentional, perhaps to limit resource usage during logging, or is it a bug? Currently, because of this behavior, using a training dataset with clips close to the 30-second limit (but longer than 15 seconds) is problematic, as the generated samples during training are of very poor quality.

SWivid · 2025-01-14T08:18:38Z

thanks for report this,

F5-TTS/src/f5_tts/model/trainer.py

Lines 334 to 356 in f992c4e

    
           if self.log_samples and self.accelerator.is_local_main_process: 
        
               ref_audio_len = mel_lengths[0] 
        
               infer_text = [ 
        
                   text_inputs[0] + ([" "] if isinstance(text_inputs[0], list) else " ") + text_inputs[0] 
        
               ] 
        
               with torch.inference_mode(): 
        
                   generated, _ = self.accelerator.unwrap_model(self.model).sample( 
        
                       cond=mel_spec[0][:ref_audio_len].unsqueeze(0), 
        
                       text=infer_text, 
        
                       duration=ref_audio_len * 2, 
        
                       steps=nfe_step, 
        
                       cfg_strength=cfg_strength, 
        
                       sway_sampling_coef=sway_sampling_coef, 
        
                   ) 
        
                   generated = generated.to(torch.float32) 
        
                   gen_mel_spec = generated[:, ref_audio_len:, :].permute(0, 2, 1).to(self.accelerator.device) 
        
                   ref_mel_spec = batch["mel"][0].unsqueeze(0) 
        
                   if self.vocoder_name == "vocos": 
        
                       gen_audio = vocoder.decode(gen_mel_spec).cpu() 
        
                       ref_audio = vocoder.decode(ref_mel_spec).cpu() 
        
                   elif self.vocoder_name == "bigvgan": 
        
                       gen_audio = vocoder(gen_mel_spec).squeeze(0).cpu() 
        
                       ref_audio = vocoder(ref_mel_spec).squeeze(0).cpu()

yes, if we just picked a long sample, it's actually doing an inference with double length (which surely exceed the max length seen by model)
possibly we could:

always clip first 3 sec as ref_mel (could probably cut in middle of word)
pre-select a fixed sample, and always take it for logging

hcsolakoglu · 2025-01-17T06:09:19Z

Doesn't the problem arise from this? As you said, multiple solutions are actually possible. We could select from a pool of training data under 15 seconds (I’m not saying a single sample because when training in multiple languages—e.g., English and Chinese, if only an English sample is selected, we might not be able to track progress in Chinese). Alternatively, we could allow the user to provide reference audio and text in advance, similar to a validation set. If we even allow multiple references, the user could track both in-distribution improvements (within the training data) and generalization outside the training data using external references.

@SWivid

F5-TTS/src/f5_tts/model/cfm.py

Line 93 in 9e51878

max_duration=4096,

def sample(
    self,
    cond: float["b n d"] | float["b nw"],  # noqa: F722
    text: int["b nt"] | list[str],  # noqa: F722
    duration: int | int["b"],  # noqa: F821
    *,
    lens: int["b"] | None = None,  # noqa: F821
    steps=32,
    cfg_strength=1.0,
    sway_sampling_coef=None,
    seed: int | None = None,
    max_duration=4096,
    vocoder: Callable[[float["b d n"]], float["b nw"]] | None = None,  # noqa: F722
    no_ref_audio=False,
    duplicate_test=False,
    t_inter=0.1,
    edit_mask=None,
):

F5-TTS/src/f5_tts/model/cfm.py

Line 138 in 9e51878

max_duration = duration.amax()

When log_samples is enabled, the duration passed to CFM.sample is calculated as ref_audio_len * 2.
If ref_audio_len * 2 exceeds max_duration (4096), the generated audio will be clipped to 4096 mel frames.

seconds = (4096 * 256) / 24000
seconds = 43.69

By the way, my training samples range from 1 sec to 30 sec, and I believe nothing prevents the model from learning from these samples—correct? Only sample generation is affected currently via the sampler?

hcsolakoglu added the question Further information is requested label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719

Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719

hcsolakoglu commented Jan 14, 2025

SWivid commented Jan 14, 2025

hcsolakoglu commented Jan 17, 2025 •

edited

Loading

Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719

Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719

Comments

hcsolakoglu commented Jan 14, 2025

Checks

Question details

SWivid commented Jan 14, 2025

hcsolakoglu commented Jan 17, 2025 • edited Loading

hcsolakoglu commented Jan 17, 2025 •

edited

Loading