Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint tweaks #445

Merged
merged 4 commits into from
Feb 9, 2024
Merged

Checkpoint tweaks #445

merged 4 commits into from
Feb 9, 2024

Conversation

jchiang87
Copy link
Collaborator

  • Subtract off the start_obj_num offset when writing checkpoint files so that they can be used individually or in other combinations with galsim multiprocessing.
  • Add a checkpoint_sampling_factor config parameter to reduce the number of checkpoints written to be a fraction of the number of stamp batches.

@jchiang87 jchiang87 requested review from rmjarvis and cwwalter and removed request for rmjarvis February 8, 2024 19:22
Copy link
Contributor

@rmjarvis rmjarvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks fine. But I don't think the default should be 10.

# Even if not checkpointing, using batches helps keep the memory down, since a bunch of
# temporary data can be purged after each batch.
# The default number of batches is 100, but you can change it if desired.
nbatch: 100
checkpoint_sampling_factor: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampling seems like the wrong term for this. Suggest nbatch_per_checkpoint as more directly what this means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate point -- I think our default config should probably have this set to 1. Or at least <10. Running so many concurrent jobs as we did for the Rubin/Roman production runs is not the typical use case, and I think most users will want to checkpoint more often, so they don't lose much work when their node allocation expires.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can make it less than 10, but 1 seems excessive to me. Would be nicer IMO if we could checkpoint by numbers of objects explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess now that nbatch has increased to 100, that's 10 checkpoint actions, which is probably reasonable. I think it used to be nbatch=10 IIRC. Maybe we should make the default be to do 10 checkpoints, whatever nbatch is? Or separately set ncheckpoint, which could default to 10? I'm not sure what the most reasonable default is, tbh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we could specify how many objects we should checkpoint rather than fractions, it would be nicer. But, I guess the way it is organized it really needs to be a multiple of the batching amount.

10 as a default seems reasonable to me. The "right" number is going to depend on lot on how you are using it if you really are counting on the checkpoints.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly trying to figure out what a user would expect if the value is absent from the config file. That's always a hard game to play. :)

@@ -86,6 +87,7 @@ def setup(self, config, base, image_num, obj_num, ignore, logger):
try:
self.checkpoint = galsim.config.GetInputObj('checkpoint', config, base, 'LSST_Image')
self.nbatch = params.get('nbatch', 100)
self.checkpoint_sampling_factor = params.get('checkpoint_sampling_factor', 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even moreso here, default should be 1 I think if this option is not given in the config at all.

Copy link
Member

@cwwalter cwwalter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jchiang87 jchiang87 merged commit 50249cd into main Feb 9, 2024
3 checks passed
@jchiang87 jchiang87 deleted the u/jchiang/checkpoint_fixes branch February 9, 2024 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants