Don't destroy the job unless the executor crashes #56

calebwin · 2021-10-15T00:30:14Z

We should really only end a running job if the program crashes on the executor or the user explicitly calls destroy_job.

When scheduling fails

On a call to a writing function or to collect, recorded lazy computation is scheduled and executed. If the scheduling fails, we currently destroy the job. If you're using Banyan Julia from a notebook, this is undesirable since then you have to restart the job (can take 1-2 minutes) just because a single cell failed. Instead, we should make it so that a call to a writing function or to collect does not modify global state but will roll back in the case of a failure.

When an exception occurs on the cluster

If the job crashes in the backend, we kind of have to destroy the job. But if there's just an exception that occurs, we should ideally propagate that back to the client side and roll back in the same way that we would roll back in the case of a scheduling failure.

The text was updated successfully, but these errors were encountered:

calebwin added enhancement New feature or request banyan-jl Concerning Banyan.jl labels Oct 15, 2021

calebwin self-assigned this Oct 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't destroy the job unless the executor crashes #56

Don't destroy the job unless the executor crashes #56

calebwin commented Oct 15, 2021

Don't destroy the job unless the executor crashes #56

Don't destroy the job unless the executor crashes #56

Comments

calebwin commented Oct 15, 2021

When scheduling fails

When an exception occurs on the cluster