Checkpoint tweaks #445

jchiang87 · 2024-01-23T22:21:22Z

Subtract off the start_obj_num offset when writing checkpoint files so that they can be used individually or in other combinations with galsim multiprocessing.
Add a checkpoint_sampling_factor config parameter to reduce the number of checkpoints written to be a fraction of the number of stamp batches.

write final checkpoint once all the batches are done

rmjarvis

Implementation looks fine. But I don't think the default should be 10.

rmjarvis · 2024-02-08T19:26:44Z

config/imsim-config.yaml

    # Even if not checkpointing, using batches helps keep the memory down, since a bunch of
    # temporary data can be purged after each batch.
    # The default number of batches is 100, but you can change it if desired.
    nbatch: 100
+    checkpoint_sampling_factor: 10


Sampling seems like the wrong term for this. Suggest nbatch_per_checkpoint as more directly what this means.

Separate point -- I think our default config should probably have this set to 1. Or at least <10. Running so many concurrent jobs as we did for the Rubin/Roman production runs is not the typical use case, and I think most users will want to checkpoint more often, so they don't lose much work when their node allocation expires.

We can make it less than 10, but 1 seems excessive to me. Would be nicer IMO if we could checkpoint by numbers of objects explicitly.

Actually, I guess now that nbatch has increased to 100, that's 10 checkpoint actions, which is probably reasonable. I think it used to be nbatch=10 IIRC. Maybe we should make the default be to do 10 checkpoints, whatever nbatch is? Or separately set ncheckpoint, which could default to 10? I'm not sure what the most reasonable default is, tbh.

I think if we could specify how many objects we should checkpoint rather than fractions, it would be nicer. But, I guess the way it is organized it really needs to be a multiple of the batching amount.

10 as a default seems reasonable to me. The "right" number is going to depend on lot on how you are using it if you really are counting on the checkpoints.

I'm mostly trying to figure out what a user would expect if the value is absent from the config file. That's always a hard game to play. :)

rmjarvis · 2024-02-08T19:30:30Z

imsim/lsst_image.py

@@ -86,6 +87,7 @@ def setup(self, config, base, image_num, obj_num, ignore, logger):
        try:
            self.checkpoint = galsim.config.GetInputObj('checkpoint', config, base, 'LSST_Image')
            self.nbatch = params.get('nbatch', 100)
+            self.checkpoint_sampling_factor = params.get('checkpoint_sampling_factor', 10)


Even moreso here, default should be 1 I think if this option is not given in the config at all.

…ggestions

cwwalter

LGTM

jchiang87 added 3 commits January 23, 2024 08:32

subtract base[start_obj_num] offset from checkpointed obj_num value

4203376

reduce checkpoint frequency by configurable checkpoint_sampling_factor

a981ec2

write final checkpoint once all the batches are done

test for checkpoint_sampling_factor

0a6e956

jchiang87 requested review from rmjarvis and cwwalter and removed request for rmjarvis February 8, 2024 19:22

rmjarvis requested changes Feb 8, 2024

View reviewed changes

change config parameter name and default value using Mike's review su…

eb0f63b

…ggestions

cwwalter reviewed Feb 9, 2024

View reviewed changes

rmjarvis approved these changes Feb 9, 2024

View reviewed changes

jchiang87 merged commit 50249cd into main Feb 9, 2024
3 checks passed

jchiang87 deleted the u/jchiang/checkpoint_fixes branch February 9, 2024 18:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint tweaks #445

Checkpoint tweaks #445

jchiang87 commented Jan 23, 2024

rmjarvis left a comment

rmjarvis Feb 8, 2024

rmjarvis Feb 8, 2024

jchiang87 Feb 8, 2024

rmjarvis Feb 9, 2024

cwwalter Feb 9, 2024

rmjarvis Feb 9, 2024

rmjarvis Feb 8, 2024

cwwalter left a comment

Checkpoint tweaks #445

Checkpoint tweaks #445

Conversation

jchiang87 commented Jan 23, 2024

rmjarvis left a comment

Choose a reason for hiding this comment

rmjarvis Feb 8, 2024

Choose a reason for hiding this comment

rmjarvis Feb 8, 2024

Choose a reason for hiding this comment

jchiang87 Feb 8, 2024

Choose a reason for hiding this comment

rmjarvis Feb 9, 2024

Choose a reason for hiding this comment

cwwalter Feb 9, 2024

Choose a reason for hiding this comment

rmjarvis Feb 9, 2024

Choose a reason for hiding this comment

rmjarvis Feb 8, 2024

Choose a reason for hiding this comment

cwwalter left a comment

Choose a reason for hiding this comment