Training strategy for Zipformer using fp16 ?? #1461
Unanswered
ZQuang2202
asked this question in
Q&A
Replies: 1 comment 1 reply
-
Are you using single GPU and max-duration=300? The gradient noise might be large with such a small batch size. You could try a smaller base-lr, like 0.025, and keep lr_batch/lr_epoch unchanged. Usually you don't need to tune the Balancer and Whitener configurations. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everyone,
I am a student attempting to reproduce the results of Zipformer on Librispeech 100h, facing limitations in hardware resources that prevent me from using the recommended configuration. Due to these constraints, I have reduced the batch size (max_duration) to 300, as opposed to the recommended 1000. However, I am struggling to find the appropriate configuration for Eden.
Following the training strategy that suggests decreasing the learning rate by √k times when the batch size decreases by k times, I initially set the base_lr to 0.03, keeping other configurations at their default values. But the training process diverges. Despite attempts to adjust lr_batches, lr_epochs (3.5-6), and base_lr (0.3-0.45), it's still not working. Notably, the training process encounters divergence when the batch_count is around 700-900, leading to 'parameter domination' issues in the embed_conv and some attention modules. I attach some log information below.
![image](https://private-user-images.githubusercontent.com/152836329/296590676-5e15854b-ab8f-4b6c-9226-7604dba0db1f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDEwNzUsIm5iZiI6MTczOTMwMDc3NSwicGF0aCI6Ii8xNTI4MzYzMjkvMjk2NTkwNjc2LTVlMTU4NTRiLWFiOGYtNGI2Yy05MjI2LTc2MDRkYmEwZGIxZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTA2MTVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT03YzlmZWNjMTBiMDE5NTFjZmQ4ODA3YmY1MjhhNzZhZjZkZmU5MjI3YjExM2U2MDI0YjQyYzJiNjYxYjFiYzZkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.mJqk0TjHkeOwPYv2_coPGBTLkRVU1us6G3W_rxFQ4EQ)
![image](https://private-user-images.githubusercontent.com/152836329/296590793-5d36af68-134d-4c43-a235-ba3d3354db3f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDEwNzUsIm5iZiI6MTczOTMwMDc3NSwicGF0aCI6Ii8xNTI4MzYzMjkvMjk2NTkwNzkzLTVkMzZhZjY4LTEzNGQtNGM0My1hMjM1LWJhM2QzMzU0ZGIzZi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTA2MTVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wMzFkYmMyZmJlOTgxM2MyOWY2ZjY4NWFkMDJkNmMxYWIwYjk0NDE2N2FlZmI2ZTBhODM3YWVhY2U1NWIyMjhlJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.MNEg5If9twGowyW1ukBHr0PW5oDz1Q6Ul2ZHfH0SXZE)
![image](https://private-user-images.githubusercontent.com/152836329/296590946-a5c580e9-8969-48f4-a679-2e687b51c04f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMDEwNzUsIm5iZiI6MTczOTMwMDc3NSwicGF0aCI6Ii8xNTI4MzYzMjkvMjk2NTkwOTQ2LWE1YzU4MGU5LTg5NjktNDhmNC1hNjc5LTJlNjg3YjUxYzA0Zi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjExJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMVQxOTA2MTVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT0wMGRmOWE5YTI4YmMxNGY5NDUyZjU2YjQ2MWU3MmY0MDUwNDRkMjZiODcyZjY4NGYyNzc4ODAyZjNhMGFhMGVhJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.gH3zpkHgq4V0OEYn2mUzZqJ67kK2FN7rpXVwWkiKLeE)
In an effort to address this, I attempted to reduce the gradient scale of the layers experiencing 'parameter domination,' but this proved ineffective."
I have few questions:
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions