Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker RestartPolicy not setable #856

Closed
briceruzand opened this issue Jan 18, 2024 · 2 comments
Closed

Worker RestartPolicy not setable #856

briceruzand opened this issue Jan 18, 2024 · 2 comments

Comments

@briceruzand
Copy link

In order to use AutoScaling, and more specifically scaledown feature.
I understand that I needs to set pods restartPolicy to Never or OnFailure to allow my workers pods normally shutdown and not restart automatically.

But when I redeploy my working daskcluster by adding restartPolicy: Never, worker failed to deploy. Operator failed to update replicas (see logs)

---
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: dask-cluster
spec:
  worker:
    replicas: 2
    spec:
      restartPolicy: Never
      ...

---
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: dask-cluster-autoscaler
spec:
  cluster: dask-cluster
  minimum: 2  # minimum number of workers to create
  maximum: 4 # maximum number of workers to create

Environment:

  • Dask kubernetes version: 2024.1.0
  • K8s : 1.28

Dask Operator logs:

[2024-01-18 14:45:48,174] kopf.objects         [ERROR   ] [*****/dask-cluster-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 591, in daskworkergroup_replica_update
    await worker_deployment.create()
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 232, in create
    async with self.api.call_api(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 134, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 759, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '422 Unprocessable Entity' for url 'https://10.32.0.1/apis/apps/v1/namespaces/*****/deployments'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422
[2024-01-18 14:45:48,196] kopf.objects         [WARNING ] [*****/dask-cluster-default] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskworkergroup_replica_update/spec.worker.replicas': {'started': '2024-01-18T14:45:47.030812', 'stopped': None, 'delayed': '2024-01-18T14:46:48.175474', 'purpose': 'create', 'retries': 1, 'success': False, 'failure': False, 'message': "Client error '422 Unprocessable Entity' for url 'https://10.32.0.1/apis/apps/v1/namespaces/*****/deployments'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422", 'subrefs': None}}}, None),)
[2024-01-18 14:45:49,647] kopf.activities.prob [INFO    ] Activity 'now' succeeded.
@jacobtomlinson
Copy link
Member

We use a Deployment to create each worker to help handle cases like evictions. However that means the only valid restartPolicy is Always.

This looks like a duplicate of #855 so I'm going to close this out. If I'm wrong and this is different please command here and I'll reopen it.

@jacobtomlinson jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2024
@briceruzand
Copy link
Author

It looks like #855 scale down never succeeded due to pod restart.
I think, I had to set restartPolicy: Never like in this doc https://github.com/dask/dask-kubernetes/blob/main/doc/source/kubecluster_migrating.rst?plain=1#L145
Thx

@briceruzand briceruzand changed the title Woker RestartPolicy not setable Worker RestartPolicy not setable Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants