Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the ~15 Second Clipping of Logged Audio Samples During Training Intentional? #719

Open
4 tasks done
hcsolakoglu opened this issue Jan 14, 2025 · 2 comments
Open
4 tasks done
Labels
question Further information is requested

Comments

@hcsolakoglu
Copy link
Contributor

Checks

  • This template is only for question, not feature requests or bug reports.
  • I have thoroughly reviewed the project documentation and read the related paper(s).
  • I have searched for existing issues, including closed ones, no similar questions.
  • I confirm that I am using English to submit this report in order to facilitate communication.

Question details

When training with the log_samples option enabled, generated audio samples are consistently clipped to approximately 15 seconds, even if the original reference audio is longer (e.g., 29 seconds). This happens during the training process when samples are generated for logging purposes. Is this clipping behavior intentional, perhaps to limit resource usage during logging, or is it a bug? Currently, because of this behavior, using a training dataset with clips close to the 30-second limit (but longer than 15 seconds) is problematic, as the generated samples during training are of very poor quality.

@hcsolakoglu hcsolakoglu added the question Further information is requested label Jan 14, 2025
@SWivid
Copy link
Owner

SWivid commented Jan 14, 2025

thanks for report this,

if self.log_samples and self.accelerator.is_local_main_process:
ref_audio_len = mel_lengths[0]
infer_text = [
text_inputs[0] + ([" "] if isinstance(text_inputs[0], list) else " ") + text_inputs[0]
]
with torch.inference_mode():
generated, _ = self.accelerator.unwrap_model(self.model).sample(
cond=mel_spec[0][:ref_audio_len].unsqueeze(0),
text=infer_text,
duration=ref_audio_len * 2,
steps=nfe_step,
cfg_strength=cfg_strength,
sway_sampling_coef=sway_sampling_coef,
)
generated = generated.to(torch.float32)
gen_mel_spec = generated[:, ref_audio_len:, :].permute(0, 2, 1).to(self.accelerator.device)
ref_mel_spec = batch["mel"][0].unsqueeze(0)
if self.vocoder_name == "vocos":
gen_audio = vocoder.decode(gen_mel_spec).cpu()
ref_audio = vocoder.decode(ref_mel_spec).cpu()
elif self.vocoder_name == "bigvgan":
gen_audio = vocoder(gen_mel_spec).squeeze(0).cpu()
ref_audio = vocoder(ref_mel_spec).squeeze(0).cpu()

yes, if we just picked a long sample, it's actually doing an inference with double length (which surely exceed the max length seen by model)
possibly we could:

  1. always clip first 3 sec as ref_mel (could probably cut in middle of word)
  2. pre-select a fixed sample, and always take it for logging

@hcsolakoglu
Copy link
Contributor Author

hcsolakoglu commented Jan 17, 2025

Doesn't the problem arise from this? As you said, multiple solutions are actually possible. We could select from a pool of training data under 15 seconds (I’m not saying a single sample because when training in multiple languages—e.g., English and Chinese, if only an English sample is selected, we might not be able to track progress in Chinese). Alternatively, we could allow the user to provide reference audio and text in advance, similar to a validation set. If we even allow multiple references, the user could track both in-distribution improvements (within the training data) and generalization outside the training data using external references.

@SWivid

max_duration=4096,

def sample(
    self,
    cond: float["b n d"] | float["b nw"],  # noqa: F722
    text: int["b nt"] | list[str],  # noqa: F722
    duration: int | int["b"],  # noqa: F821
    *,
    lens: int["b"] | None = None,  # noqa: F821
    steps=32,
    cfg_strength=1.0,
    sway_sampling_coef=None,
    seed: int | None = None,
    max_duration=4096,
    vocoder: Callable[[float["b d n"]], float["b nw"]] | None = None,  # noqa: F722
    no_ref_audio=False,
    duplicate_test=False,
    t_inter=0.1,
    edit_mask=None,
):

max_duration = duration.amax()

When log_samples is enabled, the duration passed to CFM.sample is calculated as ref_audio_len * 2.
If ref_audio_len * 2 exceeds max_duration (4096), the generated audio will be clipped to 4096 mel frames.

seconds = (4096 * 256) / 24000
seconds = 43.69

By the way, my training samples range from 1 sec to 30 sec, and I believe nothing prevents the model from learning from these samples—correct? Only sample generation is affected currently via the sampler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants