Worker RestartPolicy not setable #856

briceruzand · 2024-01-18T15:05:25Z

In order to use AutoScaling, and more specifically scaledown feature.
I understand that I needs to set pods restartPolicy to Never or OnFailure to allow my workers pods normally shutdown and not restart automatically.

But when I redeploy my working daskcluster by adding restartPolicy: Never, worker failed to deploy. Operator failed to update replicas (see logs)

---
apiVersion: kubernetes.dask.org/v1
kind: DaskCluster
metadata:
  name: dask-cluster
spec:
  worker:
    replicas: 2
    spec:
      restartPolicy: Never
      ...

---
apiVersion: kubernetes.dask.org/v1
kind: DaskAutoscaler
metadata:
  name: dask-cluster-autoscaler
spec:
  cluster: dask-cluster
  minimum: 2  # minimum number of workers to create
  maximum: 4 # maximum number of workers to create

Environment:

Dask kubernetes version: 2024.1.0
K8s : 1.28

Dask Operator logs:

[2024-01-18 14:45:48,174] kopf.objects         [ERROR   ] [*****/dask-cluster-default] Handler 'daskworkergroup_replica_update/spec.worker.replicas' failed with an exception. Will retry.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 276, in execute_handler_once
    result = await invoke_handler(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/execution.py", line 371, in invoke_handler
    result = await invocation.invoke(
  File "/usr/local/lib/python3.10/site-packages/kopf/_core/actions/invocation.py", line 116, in invoke
    result = await fn(**kwargs)  # type: ignore
  File "/usr/local/lib/python3.10/site-packages/dask_kubernetes/operator/controller/controller.py", line 591, in daskworkergroup_replica_update
    await worker_deployment.create()
  File "/usr/local/lib/python3.10/site-packages/kr8s/_objects.py", line 232, in create
    async with self.api.call_api(
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/kr8s/_api.py", line 134, in call_api
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/httpx/_models.py", line 759, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '422 Unprocessable Entity' for url 'https://10.32.0.1/apis/apps/v1/namespaces/*****/deployments'
For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422
[2024-01-18 14:45:48,196] kopf.objects         [WARNING ] [*****/dask-cluster-default] Patching failed with inconsistencies: (('remove', ('status', 'kopf'), {'progress': {'daskworkergroup_replica_update/spec.worker.replicas': {'started': '2024-01-18T14:45:47.030812', 'stopped': None, 'delayed': '2024-01-18T14:46:48.175474', 'purpose': 'create', 'retries': 1, 'success': False, 'failure': False, 'message': "Client error '422 Unprocessable Entity' for url 'https://10.32.0.1/apis/apps/v1/namespaces/*****/deployments'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/422", 'subrefs': None}}}, None),)
[2024-01-18 14:45:49,647] kopf.activities.prob [INFO    ] Activity 'now' succeeded.

The text was updated successfully, but these errors were encountered:

jacobtomlinson · 2024-01-22T11:55:42Z

We use a Deployment to create each worker to help handle cases like evictions. However that means the only valid restartPolicy is Always.

This looks like a duplicate of #855 so I'm going to close this out. If I'm wrong and this is different please command here and I'll reopen it.

briceruzand · 2024-01-22T12:40:59Z

It looks like #855 scale down never succeeded due to pod restart.
I think, I had to set restartPolicy: Never like in this doc https://github.com/dask/dask-kubernetes/blob/main/doc/source/kubecluster_migrating.rst?plain=1#L145
Thx

jacobtomlinson closed this as not planned Won't fix, can't repro, duplicate, stale Jan 22, 2024

briceruzand changed the title ~~Woker RestartPolicy not setable~~ Worker RestartPolicy not setable Jan 22, 2024

This was referenced Jan 25, 2024

TOCTOU Bug while scaling down workers #855

Open

Create worker pods through Deployments #730

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker RestartPolicy not setable #856

Worker RestartPolicy not setable #856

briceruzand commented Jan 18, 2024

jacobtomlinson commented Jan 22, 2024

briceruzand commented Jan 22, 2024

Worker RestartPolicy not setable #856

Worker RestartPolicy not setable #856

Comments

briceruzand commented Jan 18, 2024

jacobtomlinson commented Jan 22, 2024

briceruzand commented Jan 22, 2024