Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement checkpoint resumption #362

Merged
merged 2 commits into from
Feb 21, 2024
Merged

Implement checkpoint resumption #362

merged 2 commits into from
Feb 21, 2024

Conversation

MattWellie
Copy link
Collaborator

Fixes

  • Some AIP runs are getting stuck at the moment, spending hours and then failing due to timeouts. This failure mode might be caused by general business of Hail/GCP, and doesn't appear to be due to bad data (though these runtimes are unprecedented).

Proposed Changes

  • Implements a checkpoint resumption process if the target checkpoint already exists
  • Should be followed by a pipeline job - delete checkpoints if this stage succeeds

Considerations

  • this has caused some issues in production, with runs resuming from bad data. I'll need to keep this in mind, but so far runs have not failed for data reasons, just for scheduling reasons

@MattWellie MattWellie requested a review from cassimons February 21, 2024 22:57
@MattWellie MattWellie changed the title implement checkpoint resumption Implement checkpoint resumption Feb 21, 2024
@MattWellie MattWellie requested a review from EddieLF February 21, 2024 23:03
@MattWellie MattWellie merged commit eae8bdb into main Feb 21, 2024
4 of 5 checks passed
@MattWellie MattWellie deleted the checkpoint_resumption branch February 21, 2024 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants