operator stuck in scale down loop #398

a2k47l · 2024-03-19T06:42:34Z

Expected Behavior

When CPU load is below scaleDownCPUBoundary then replica count should reduce. Thus node count should go down

Actual Behavior

When CPU load is below scaleDownCPUBoundary, index replica count is not reduced. Thus number of nodes does not go down.
Logs -
time="2024-03-19T06:27:48Z" level=info msg="Waiting for operation to stop" eds=es-mci-data namespace=mci
time="2024-03-19T06:27:49Z" level=info msg="Terminating operator loop." eds=es-mci-data namespace=mci
time="2024-03-19T06:27:50Z" level=info msg="Waiting for operation to stop" eds=es-mci-data namespace=mci
time="2024-03-19T06:27:50Z" level=error msg="Failed to operate resource: failed to update status: Put "https://10.10.0.1:443/apis/zalando.org/v1/namespaces/mci/elasticsearchdatasets/es-mci-data/status?timeout=30s\": context canceled"
time="2024-03-19T06:27:50Z" level=info msg="Terminating operator loop." eds=es-mci-data namespace=mci
time="2024-03-19T06:28:19Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci
time="2024-03-19T06:28:49Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci
time="2024-03-19T06:29:19Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci

Steps to Reproduce the Problem

I have simple setup with 1 ES cluster with 1 master and 1 EDS managed by es-operator. I have single index with 2 shard.
scaling options -
enabled: true
minReplicas: 1
maxReplicas: 6
minShardsPerNode: 1
maxShardsPerNode: 1
minIndexReplicas: 0
maxIndexReplicas: 5
scaleUpCPUBoundary: 50
scaleUpCooldownSeconds: 60
scaleUpThresholdDurationSeconds: 30
scaleDownCPUBoundary: 40
scaleDownCooldownSeconds: 60
scaleDownThresholdDurationSeconds: 30
diskUsagePercentScaledownWatermark: 75
When i start basic busybox load generator , the cpu usage increases and es-operator scales up by increasing replica count of index. But when i stop load generator , cpu usage goes down but replica count is not updated. Thus number of nodes remained high

Specifications

Version: ES 8.12.2, es-operator: latest(should be 0.1.4)
Platform: Gcloud k8s cluster

a2k47l · 2024-03-19T06:57:27Z

i have not added any group based allocation awareness and logs are not helping either. Let me know if you need further details regarding setup.

a2k47l · 2024-03-19T12:23:48Z

When i manually reduce index replica count then es-operator also reduces stateful replicas. Why is es-operator not able to reduce index replica count? Any ideas?

a2k47l · 2024-03-20T08:51:24Z

@otrosien I tried to run operator from local machine and was able to pinpoint the code which was causing the issue. Its this function that is preventing scale down i.e. Do not scale down when shard to node ratio is 1.
func calculateNodesWithSameShardToNodeRatio(currentDesiredNodeReplicas, currentTotalShards, newTotalShards int32) int32 { currentShardToNodeRatio := shardToNodeRatio(currentTotalShards, currentDesiredNodeReplicas) if currentShardToNodeRatio <= 1 { return currentDesiredNodeReplicas } return int32(math.Ceil(float64(newTotalShards) / float64(currentShardToNodeRatio))) }

I tried changing if currentShardToNodeRatio <= 1 to if currentShardToNodeRatio <= 0 and it scaled down. Is there any obvious case am missing here? Or this is a bug where my use case described above was not considered. Let me know if this change can be included.

otrosien · 2024-03-20T22:44:56Z

I think we never considered min-shards-per-node to be equal to max-shards-per-node. any reason for that kind of setup?

otrosien · 2024-11-07T12:46:26Z

@a2k47l We have found the issue, and will provide a fix for it in this PR: #452

otrosien mentioned this issue Nov 7, 2024

Bugfix: Allow scale-down when at one-shard-per-node #452

Merged

otrosien closed this as completed in #452 Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

operator stuck in scale down loop #398

operator stuck in scale down loop #398

a2k47l commented Mar 19, 2024

a2k47l commented Mar 19, 2024

a2k47l commented Mar 19, 2024

a2k47l commented Mar 20, 2024

otrosien commented Mar 20, 2024

otrosien commented Nov 7, 2024

operator stuck in scale down loop #398

operator stuck in scale down loop #398

Comments

a2k47l commented Mar 19, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

a2k47l commented Mar 19, 2024

a2k47l commented Mar 19, 2024

a2k47l commented Mar 20, 2024

otrosien commented Mar 20, 2024

otrosien commented Nov 7, 2024