-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpoint tweaks #445
Checkpoint tweaks #445
Conversation
write final checkpoint once all the batches are done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks fine. But I don't think the default should be 10.
config/imsim-config.yaml
Outdated
# Even if not checkpointing, using batches helps keep the memory down, since a bunch of | ||
# temporary data can be purged after each batch. | ||
# The default number of batches is 100, but you can change it if desired. | ||
nbatch: 100 | ||
checkpoint_sampling_factor: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sampling seems like the wrong term for this. Suggest nbatch_per_checkpoint
as more directly what this means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate point -- I think our default config should probably have this set to 1. Or at least <10. Running so many concurrent jobs as we did for the Rubin/Roman production runs is not the typical use case, and I think most users will want to checkpoint more often, so they don't lose much work when their node allocation expires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it less than 10, but 1 seems excessive to me. Would be nicer IMO if we could checkpoint by numbers of objects explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I guess now that nbatch has increased to 100, that's 10 checkpoint actions, which is probably reasonable. I think it used to be nbatch=10 IIRC. Maybe we should make the default be to do 10 checkpoints, whatever nbatch is? Or separately set ncheckpoint, which could default to 10? I'm not sure what the most reasonable default is, tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we could specify how many objects we should checkpoint rather than fractions, it would be nicer. But, I guess the way it is organized it really needs to be a multiple of the batching amount.
10 as a default seems reasonable to me. The "right" number is going to depend on lot on how you are using it if you really are counting on the checkpoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly trying to figure out what a user would expect if the value is absent from the config file. That's always a hard game to play. :)
imsim/lsst_image.py
Outdated
@@ -86,6 +87,7 @@ def setup(self, config, base, image_num, obj_num, ignore, logger): | |||
try: | |||
self.checkpoint = galsim.config.GetInputObj('checkpoint', config, base, 'LSST_Image') | |||
self.nbatch = params.get('nbatch', 100) | |||
self.checkpoint_sampling_factor = params.get('checkpoint_sampling_factor', 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even moreso here, default should be 1 I think if this option is not given in the config at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
start_obj_num
offset when writing checkpoint files so that they can be used individually or in other combinations with galsim multiprocessing.checkpoint_sampling_factor
config parameter to reduce the number of checkpoints written to be a fraction of the number of stamp batches.