Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Just to see the diff #3
base: main
Are you sure you want to change the base?
Just to see the diff #3
Changes from 1 commit
57db623
95d2009
9432ee2
48e29c6
6d0a5e5
a5bd7d4
9af45a8
e29ead4
197a23a
7009225
0c8550e
bbf8e46
013e006
2887158
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about the batch size: we're assuming that we can fit a batch size of 320 with our workers, but I think we can only fit 12 sequences on A100 40GB (so on 16 workers: batch of 16*12=192).
So we should probably either incorporate gradient accumulation and store the losses for 2 iterations (2 * 10 (small bz) * 16 gpus=320) or we can change the batch sizes from 320/32 to something that suits us with a 10% ratio like 160/16. In the paper they just talk about the 10% ratio but I'm not sure if using large batches si also important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no gradients here which means that a) we can likely fit a bigger batch size than 12 b) instead of grad acc. we can just run multiple times right after another & store the losses if it doesnt fit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes right! by grad acc. I also meant doing similar iterations over the losses