-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine vae model convergence #18
Comments
Additionally, what is the method for determining if the diffusion model has converged or not? I noticed that the loss ceased to decrease in the early epochs, but the overall quality of the samples has continued to improve over time. |
hello,have you ever encountered a situation where the loss becomes nan when training VAE |
For the diffusion model, the loss tend to have high variance: it's hard to judge from the loss about the convergence. I usually 1) evaluate the checkpoint every 1000 epoch and determine from the evaluation metric and 2) visualize the results. My experience is that LION usually converge at around 10k iteration. |
@fradino for the NaN issue, could you start another issue and post your log & config so that I can help with that? |
Hello! I'd like to ask how I can determine if my VAE model has converged. Which metrics or loss should I look at? When I'm training on the car dataset, as the KL weights increase, the latent points become more noisy, leading to a decrease in reconstruction quality. Is it possible that if I keep training the model, the reconstruction quality will continue to get worse? If so, how can I know when to stop training?
I used the default config. trainer.epochs set to 800.

step 25480
The text was updated successfully, but these errors were encountered: