Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement abort running backup job #2098

Open
dns2utf8 opened this issue Dec 3, 2019 · 10 comments
Open

Implement abort running backup job #2098

dns2utf8 opened this issue Dec 3, 2019 · 10 comments

Comments

@dns2utf8
Copy link
Contributor

dns2utf8 commented Dec 3, 2019

Describe the problem/challenge you have
If one submits a backup job and forgets the narrowing select the system is blocked.

Describe the solution you'd like
I would like to be able to abort a job so the queue continues to work.

Anything else you would like to add:
Having short jobs that run every minute pile up very quickly

Environment:

  • velero version
    Client:
    Version: v1.2.0
    Git commit: 5d00849
    Server:
    Version: v1.2.0

  • kubectl version
    Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.6", GitCommit:"7015f71e75f670eb9e7ebd4b5749639d42e20079", GitTreeState:"clean", BuildDate:"2019-11-13T11:20:18Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

  • Kubernetes installer & version: v1.15.3

  • Cloud provider or hardware configuration: Intel(R) Xeon(R) CPU E5-2650

  • OS (e.g. from /etc/os-release): Ubuntu 18.04.3 LTS (Bionic Beaver)

@skriss skriss added the Enhancement/User End-User Enhancement to Velero label Dec 6, 2019
@skriss
Copy link
Contributor

skriss commented Dec 6, 2019

Adding to this, it would be nice to be able to:

  • abort a running job
  • detect and either fail or restart jobs that are hung

We might be able to get some of this behavior for free if we move to running backups/restores in their own Pods/Jobs, per #1653

@d3473r
Copy link

d3473r commented Feb 12, 2021

I started a backup job on a tainted node on which no restic workload was running until i added the appropriate toleration key.
As i couldn't abort the running backup I had to wait ~4 hours until the job runned into the timeout :/
The abort feature would be really helpfull in a misshap like this.

@dsu-igeek
Copy link
Contributor

Should also handle aborting restores in progress

@KamranAzeem
Copy link

I was stuck in similar situation. Started a backup job only to realize later that I forgot something important.

Instead of waiting for 2+ hours, I decided to delete velero and re-installed it. We had a tight maintenance window so I took this path. A backup abort would be very helpful in such situations.

@ywk253100
Copy link
Contributor

This would be easier to cancel the in-progress backup in #4772

@reasonerjt reasonerjt added kind/requirement Reviewed Q2 2021 Enhancement/User End-User Enhancement to Velero and removed Enhancement/User End-User Enhancement to Velero Reviewed Q2 2021 labels May 20, 2022
@eleanor-millman eleanor-millman added the 1.10-candidate The label used for 1.10 planning discussion. label May 25, 2022
@reasonerjt
Copy link
Contributor

I think this is a valid requirement, however currently the whole backup is running one reconcile action.
We can only abort a running backup after we rework the mechanism to trigger it. We may combine it with the effort to support running backup concurrently.
I'll leave it in the backlog.

@reasonerjt reasonerjt added backlog Needs investigation and removed 1.10-candidate The label used for 1.10 planning discussion. labels Jun 17, 2022
@Lyndon-Li Lyndon-Li self-assigned this Dec 6, 2022
@aterrell5
Copy link

Similar issue and requirement from production. Accidently left a namespace and objects in a backup job that didn't need it. No way to stop or abort the backup from going to the offsite storage bucket consuming space, bandwidth, money, and time. Desperately need a way to cleanly abort a task. The issue will only get more problematic with time as the dataset and workloads grow.

@benedikt-bartscher
Copy link

smae here, i typed a wrong label selector and created a backup of my whole cluster...

@kaovilai
Copy link
Member

kaovilai commented Feb 4, 2025

cc: @sseago @draghuram
re: backup abort

We can ignore retry on NotExists so it only retry on other errors.
Then we can watch backup for delete events and then we can cancel context of the backup.
A finalizer should be added to backup while backup is running, removed when cancel cleanup is done.

@sseago
Copy link
Collaborator

sseago commented Feb 4, 2025

@kaovilai Watching delete might be helpful for cleaning up after questionable user actions, but I don't think kubectl backup delete should be the documented or expected API for backup cancellation. We want an explicit cancel -- either by introducing a backup.spec.cancel field or by adding a BackupCancellationRequest CR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests