Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

operator stuck in scale down loop #398

Closed
a2k47l opened this issue Mar 19, 2024 · 5 comments · Fixed by #452
Closed

operator stuck in scale down loop #398

a2k47l opened this issue Mar 19, 2024 · 5 comments · Fixed by #452

Comments

@a2k47l
Copy link

a2k47l commented Mar 19, 2024

Expected Behavior

When CPU load is below scaleDownCPUBoundary then replica count should reduce. Thus node count should go down

Actual Behavior

When CPU load is below scaleDownCPUBoundary, index replica count is not reduced. Thus number of nodes does not go down.
Logs -
time="2024-03-19T06:27:48Z" level=info msg="Waiting for operation to stop" eds=es-mci-data namespace=mci
time="2024-03-19T06:27:49Z" level=info msg="Terminating operator loop." eds=es-mci-data namespace=mci
time="2024-03-19T06:27:50Z" level=info msg="Waiting for operation to stop" eds=es-mci-data namespace=mci
time="2024-03-19T06:27:50Z" level=error msg="Failed to operate resource: failed to update status: Put "https://10.10.0.1:443/apis/zalando.org/v1/namespaces/mci/elasticsearchdatasets/es-mci-data/status?timeout=30s\": context canceled"
time="2024-03-19T06:27:50Z" level=info msg="Terminating operator loop." eds=es-mci-data namespace=mci
time="2024-03-19T06:28:19Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci
time="2024-03-19T06:28:49Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci
time="2024-03-19T06:29:19Z" level=info msg="Scaling hint: DOWN" eds=es-mci-data namespace=mci

Steps to Reproduce the Problem

  1. I have simple setup with 1 ES cluster with 1 master and 1 EDS managed by es-operator. I have single index with 2 shard.

  2. scaling options -
    enabled: true
    minReplicas: 1
    maxReplicas: 6
    minShardsPerNode: 1
    maxShardsPerNode: 1
    minIndexReplicas: 0
    maxIndexReplicas: 5
    scaleUpCPUBoundary: 50
    scaleUpCooldownSeconds: 60
    scaleUpThresholdDurationSeconds: 30
    scaleDownCPUBoundary: 40
    scaleDownCooldownSeconds: 60
    scaleDownThresholdDurationSeconds: 30
    diskUsagePercentScaledownWatermark: 75

  3. When i start basic busybox load generator , the cpu usage increases and es-operator scales up by increasing replica count of index. But when i stop load generator , cpu usage goes down but replica count is not updated. Thus number of nodes remained high

Specifications

  • Version: ES 8.12.2, es-operator: latest(should be 0.1.4)
  • Platform: Gcloud k8s cluster
@a2k47l
Copy link
Author

a2k47l commented Mar 19, 2024

i have not added any group based allocation awareness and logs are not helping either. Let me know if you need further details regarding setup.

@a2k47l
Copy link
Author

a2k47l commented Mar 19, 2024

When i manually reduce index replica count then es-operator also reduces stateful replicas. Why is es-operator not able to reduce index replica count? Any ideas?

@a2k47l
Copy link
Author

a2k47l commented Mar 20, 2024

@otrosien I tried to run operator from local machine and was able to pinpoint the code which was causing the issue. Its this function that is preventing scale down i.e. Do not scale down when shard to node ratio is 1.
func calculateNodesWithSameShardToNodeRatio(currentDesiredNodeReplicas, currentTotalShards, newTotalShards int32) int32 { currentShardToNodeRatio := shardToNodeRatio(currentTotalShards, currentDesiredNodeReplicas) if currentShardToNodeRatio <= 1 { return currentDesiredNodeReplicas } return int32(math.Ceil(float64(newTotalShards) / float64(currentShardToNodeRatio))) }

I tried changing if currentShardToNodeRatio <= 1 to if currentShardToNodeRatio <= 0 and it scaled down. Is there any obvious case am missing here? Or this is a bug where my use case described above was not considered. Let me know if this change can be included.

@otrosien
Copy link
Member

I think we never considered min-shards-per-node to be equal to max-shards-per-node. any reason for that kind of setup?

@otrosien
Copy link
Member

otrosien commented Nov 7, 2024

@a2k47l We have found the issue, and will provide a fix for it in this PR: #452

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants