Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support preserving only the Head pod with RayJob when shutdownAfterJobFinishes=true #2615

Open
2 tasks done
andrewsykim opened this issue Dec 5, 2024 · 8 comments
Assignees
Labels
1.4.0 enhancement New feature or request rayjob

Comments

@andrewsykim
Copy link
Collaborator

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

When using RayJob with shutdownAfterJobFinishes=true, it can useful to keep the Head pod around to debug issues or access the Ray dashboard while deleting the workers that are using more expensive resources.

Use case

Allow access to the Ray dashboard after job finishes.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@andremoeller
Copy link

Similarly or alternatively: dumping state to persistent storage to allow resuming Dashboard after a Job has finished.

@MortalHappiness
Copy link
Member

Should we add a flag to specify whether to preserve the head pod?

@MortalHappiness
Copy link
Member

In this case, the RayCluster must be retained and cannot be deleted. There are two possible approaches:

  1. Add a field to the RayCluster CR, and when this field is set to true, delete all workers.
  2. Force all worker replicas (or max replicas) to be set to 0.

@andrewsykim
Copy link
Collaborator Author

andrewsykim commented Dec 8, 2024

I'm thinking of a new field like deletePolicy or cleanupPolicy, with options like DeleteCluster, DeleteWorkers, and None. But this would overlap with existing fields like shutdownAfterJobFinishes

@kevin85421
Copy link
Member

This makes sense to me.

@andrewsykim
Copy link
Collaborator Author

Initial PR is here #2643

@kevin85421
Copy link
Member

@kevin85421
Copy link
Member

For v1.3.0, we decided not to make suspend user-facing for now, and #2765 will be included in v1.4.0. The scope of this API for v1.3.0 has already been completed. Change the tag from v1.3.0 to v1.4.0.

@kevin85421 kevin85421 added 1.4.0 and removed 1.3.0 labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1.4.0 enhancement New feature or request rayjob
Projects
None yet
Development

No branches or pull requests

4 participants