Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB is persistent but ray jobs aren't #690

Open
1 task done
agpituk opened this issue Jan 20, 2025 · 2 comments · May be fixed by #744
Open
1 task done

DB is persistent but ray jobs aren't #690

agpituk opened this issue Jan 20, 2025 · 2 comments · May be fixed by #744
Assignees
Labels
enhancement New feature or request

Comments

@agpituk
Copy link
Contributor

agpituk commented Jan 20, 2025

Motivation

After we added minio, our DB became persistent but our jobs in ray aren't. That means when we restart the containers we can see what's in the database but not what is in ray, causing some issues. We need to discuss how to approach this (should we have a volume for ray to make it persistent? should we make the api fault tolerant anyway?)

Alternatives

No response

Contribution

No response

Have you searched for similar issues before submitting this one?

  • Yes, I have searched for similar issues
@agpituk agpituk added the enhancement New feature or request label Jan 20, 2025
@aittalam
Copy link
Member

aittalam commented Jan 20, 2025

Just a tiny note: after the minio merge files on S3 are persistent, the DB however is not yet due to the considerations noted in this draft PR: #674

@javiermtorres javiermtorres self-assigned this Jan 24, 2025
@javiermtorres
Copy link
Contributor

The only option for metadata persistence in Ray is enabling an external Redis server: https://www.anyscale.com/blog/ray-version-1-11-released#ray-no-longer-launches-redis-by-default , https://docs.ray.io/en/latest/ray-core/fault_tolerance/gcs.html, so the PR will take this route.

Ray seems to be considering at least persistence for job state and related information, e.g. ray-project/ray#39503, but apparently this is marked as "important, but not time critical". Embedded KV storages like RocksDB or SplinterDB could play a similar role to SQLite in this space.

A lightly loaded Redis should be a few MB more than 3MB in memory, according to Redis documentation: https://redis.io/docs/latest/develop/get-started/faq/#whats-the-redis-memory-footprint. The container will be configured properly to keep this memory under control. It may also be necessary to add documentation to instruct the user to clean up the memory volumes if a sizable amount of job state is stored. The lifetime of all named volumes will be tied together so that db and ray persistent data are always consistent.

@javiermtorres javiermtorres linked a pull request Jan 27, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants