Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scaleup step size feature #132

Merged
merged 30 commits into from
Jan 9, 2024
Merged

Conversation

kazuki-hanai
Copy link
Contributor

@kazuki-hanai kazuki-hanai commented Nov 2, 2023

What this PR does

  • Add scaleupStepSize and scaleUpInterval field to control scale up feature
    • scaleUpStepSize: The minimum value by which the scaling will increase in one step.
    • scaleUpInterval: The time interval at which the scaling up will occur.

Why we need it

In a scaling up situation, the rapid growth of processing units (PU) can cause an increase in latency for Spanner by rearrangement of splits. To prevent this issue, we propose a controlling method for scaling up the PUs.

Which issue(s) this PR fixes

@github-actions github-actions bot added the size/M label Nov 2, 2023
@github-actions github-actions bot added size/L and removed size/M labels Nov 9, 2023
@kazuki-hanai kazuki-hanai marked this pull request as ready for review November 9, 2023 08:52
@kazuki-hanai kazuki-hanai requested a review from a team as a code owner November 9, 2023 08:52
@kazuki-hanai kazuki-hanai changed the title [WIP] Add scaleup function Add scaleup function Nov 10, 2023
kazuki-hanai and others added 2 commits November 10, 2023 15:59
@kazuki-hanai
Copy link
Contributor Author

@tjun Thanks for corrections!

@@ -108,6 +108,14 @@ type ScaleConfig struct {
// The cool down period between two consecutive scaledown operations. If this option is omitted, the value of the `--scale-down-interval` command line option is taken as the default value.
ScaledownInterval *metav1.Duration `json:"scaledownInterval,omitempty"`

// The maximum number of processing units which can be added in one scale-up operation. It can be a multiple of 100 for values < 1000, or a multiple of 1000 otherwise.
// +kubebuilder:default=2000
ScaleupStepSize int `json:"scaleupStepSize,omitempty"`
Copy link
Contributor

@tkuchiki tkuchiki Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For compatibility, could you please make the default value zero and allow unlimited scaling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// How often autoscaler is reevaluated for scale up.
// The warm up period between two consecutive scaleup operations. If this option is omitted, the value of the `--scale-up-interval` command line option is taken as the default value.
ScaleupInterval *metav1.Duration `json:"scaleupInterval,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as ScaleupStepSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it to 10 seconds because default interval for scaleup was set to 10 seconds before introducing scaleup feature.
https://github.com/mercari/spanner-autoscaler/pull/132/files#diff-1984e42ef0a0d5e7edf140b3ae6decf3fd6e1b9bd4ee351d4dcd8cf7213bb537L488

cmd/main.go Outdated
@@ -60,6 +60,7 @@ var (
enableLeaderElection = flag.Bool("leader-elect", false, "Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.")
leaderElectionID = flag.String("leader-elect-id", "", "Lease name for leader election.")
scaleDownInterval = flag.Duration("scale-down-interval", 55*time.Minute, "The scale down interval.")
scaleUpInterval = flag.Duration("scale-up-interval", 55*time.Minute, "The scale up interval.")
Copy link
Contributor

@tkuchiki tkuchiki Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment of ScaleupStepSize #132 (comment)

Co-authored-by: tkuchiki <[email protected]>
Copy link
Contributor

@tkuchiki tkuchiki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the following manifests. The behavior seems to be fine.

Set scalupStepSize and scaleupInterval

apiVersion: spanner.mercari.com/v1beta1
kind: SpannerAutoscaler
metadata:
  labels:
    app.kubernetes.io/name: spannerautoscaler
    app.kubernetes.io/instance: test-instance
    app.kubernetes.io/part-of: spanner-autoscaler
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: spanner-autoscaler
  name: spannerautoscaler-sample-beta
  namespace: test
spec:
  targetInstance:
    projectId: test-project
    instanceId: test-instance
  authentication:
    iamKeySecret:
      namespace: test
      name: spanner-autoscaler-gcp-sa
      key: service-account
  scaleConfig:
    processingUnits:
      min: 100
      max: 1000
    targetCPUUtilization:
      highPriority: 30
    scaledownStepSize: 200
    scaledownInterval: 3m
    scaleupStepSize: 100
    scaleupInterval: 3m
image

Set scaleupInterval

apiVersion: spanner.mercari.com/v1beta1
kind: SpannerAutoscaler
metadata:
  labels:
    app.kubernetes.io/name: spannerautoscaler
    app.kubernetes.io/instance: test-instance
    app.kubernetes.io/part-of: spanner-autoscaler
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/created-by: spanner-autoscaler
  name: spannerautoscaler-sample-beta
  namespace: test
spec:
  targetInstance:
    projectId: test-project
    instanceId: test-instance
  authentication:
    iamKeySecret:
      namespace: test
      name: spanner-autoscaler-gcp-sa
      key: service-account
  scaleConfig:
    processingUnits:
      min: 100
      max: 10000
    targetCPUUtilization:
      highPriority: 30
    scaledownStepSize: 1000
    scaledownInterval: 3m
    scaleupInterval: 3m
image

Co-authored-by: tkuchiki <[email protected]>
@kazuki-hanai
Copy link
Contributor Author

@tkuchiki Thank you for reviewing. I checked your comments and suggestions and responded those.

@tkuchiki tkuchiki requested a review from rustycl0ck December 6, 2023 03:25
tkuchiki
tkuchiki previously approved these changes Dec 7, 2023
Copy link
Contributor

@tkuchiki tkuchiki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -485,7 +500,7 @@ func (r *SpannerAutoscalerReconciler) needUpdateProcessingUnits(log logr.Logger,
log.Info("no need to scale", "currentPU", currentProcessingUnits, "currentCPU", sa.Status.CurrentHighPriorityCPUUtilization)
return false

case desiredProcessingUnits > currentProcessingUnits && r.clock.Now().Before(sa.Status.LastScaleTime.Time.Add(10*time.Second)):
case desiredProcessingUnits > currentProcessingUnits && r.clock.Now().Before(sa.Status.LastScaleTime.Time.Add(getOrConvertTimeDuration(sa.Spec.ScaleConfig.ScaleupInterval, r.scaleUpInterval))):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the scaleUpInterval is zero, then I think it might try to update Spanner instance back to back without allowing any time for Google Cloud to actually perform the scale-up operation. Did it work fine during real scale-up testing?

Copy link
Contributor Author

@kazuki-hanai kazuki-hanai Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rustycl0ck
I think it's better to set default scaleupInterval as 10 seconds because default interval was 10 seconds before this PR. How do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting default scaleUpInterval as less than 60 doesn't work anything because reconcile interval is 60 seconds. So I set the default value as 60 seconds.

Comment on lines 560 to 562
if suStepSize < 1000 && sa.Status.CurrentProcessingUnits+suStepSize > 1000 {
suStepSize = 1000
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For suStepSize == 0 (which is the default), suStepSize will always be set to 1000 if the current size of spanner instance is more than 1000 PUs. Is this expected?

@@ -146,7 +148,7 @@ var _ = Describe("Check Update Nodes", func() {
Expect(got).To(BeFalse())
})

It("needs to scale up nodes because cooldown interval is only applied to scale down operations", func() {
It("does not need to scale up nodes because enough time has not elapsed since last update", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case should have 4 condition checks:

  1. scales down when outside scaleDownInterval
  2. does not scale down when inside scaleDownInterval
  3. scales up when outside scaleUpInterval
  4. does not scale up when inside scaleUpInterval

Current PR reduces it to only check conditions number 2 and 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing.
I fixed it in a85f59e

Comment on lines 190 to 217
Entry("should not scale", 25, 200, 30, 100, 1000, 2000, 2000, 200),
Entry("should scale up 1", 50, 300, 30, 100, 1000, 2000, 2000, 600),
Entry("should scale up 2", 50, 1000, 30, 1000, 10000, 2000, 2000, 2000),
Entry("should scale up 3", 50, 900, 40, 100, 5000, 2000, 2000, 2000),
Entry("should scale down 1", 30, 500, 50, 100, 10000, 2000, 2000, 400),
Entry("should scale down 2", 30, 5000, 50, 1000, 10000, 2000, 2000, 4000),
Entry("should scale down 3", 25, 1000, 65, 300, 10000, 800, 2000, 400),
Entry("should scale down 4", 25, 800, 65, 300, 10000, 800, 2000, 400),
Entry("should scale down 5", 25, 700, 65, 300, 10000, 800, 2000, 300),
Entry("should scale up to max PUs 1", 50, 300, 30, 100, 400, 2000, 2000, 400),
Entry("should scale up to max PUs 2", 50, 3000, 30, 1000, 4000, 2000, 2000, 4000),
Entry("should scale down to min PUs 1", 0, 500, 50, 100, 1000, 2000, 2000, 100),
Entry("should scale down to min PUs 2", 0, 5000, 50, 1000, 10000, 5000, 2000, 1000),
Entry("should scale down to min PUs 3", 0, 5000, 50, 100, 10000, 5000, 2000, 100),
Entry("should scale down with ScaledownStepSize 1", 30, 10000, 50, 5000, 10000, 2000, 2000, 8000),
Entry("should scale down with ScaledownStepSize 2", 30, 10000, 50, 5000, 12000, 200, 2000, 9000),
Entry("should scale down with ScaledownStepSize 3", 30, 10000, 50, 5000, 12000, 100, 2000, 9000),
Entry("should scale down with ScaledownStepSize 4", 30, 10000, 50, 5000, 12000, 900, 2000, 9000),
Entry("should scale down with ScaledownStepSize 5", 25, 2000, 65, 300, 10000, 500, 2000, 1000),
Entry("should scale down with ScaledownStepSize 6", 25, 2000, 65, 300, 10000, 800, 2000, 1000),
Entry("should scale down with ScaledownStepSize 7", 25, 1000, 65, 300, 10000, 500, 2000, 500),
Entry("should scale down with ScaledownStepSize 8", 20, 800, 75, 300, 10000, 200, 2000, 600),
Entry("should scale down with ScaledownStepSize 9", 25, 2000, 50, 300, 10000, 500, 2000, 2000),
Entry("should scale down with ScaledownStepSize 10", 25, 2000, 50, 300, 10000, 1000, 2000, 2000),
Entry("should scale down with ScaledownStepSize 11", 25, 2000, 70, 300, 10000, 400, 2000, 1000),
Entry("should scale down with ScaledownStepSize 12", 25, 1000, 70, 300, 10000, 400, 2000, 600),
Entry("should scale down with ScaledownStepSize 13", 20, 2000, 50, 300, 10000, 200, 2000, 1000),
Entry("should scale down with ScaledownStepSize 14", 20, 2000, 75, 300, 10000, 200, 2000, 1000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In these old test cases, the default value of scaleupStepSize has been set to 2000, but the actual default value is 0. So when this value is not being tested for any effect, it should be set to the default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this in c4ef133

Comment on lines 429 to 481
var _ = Describe("Get and overwrite scaleup interval", func() {
var testReconciler *SpannerAutoscalerReconciler
controllerScaleUpInterval := 55 * time.Minute

BeforeEach(func() {
By("Creating a test reconciler")
testReconciler = &SpannerAutoscalerReconciler{
scaleUpInterval: controllerScaleUpInterval,
clock: testingclock.NewFakeClock(fakeTime),
log: logr.Discard(),
}
})

It("should get controller default scaleup interval", func() {
want := controllerScaleUpInterval
sa := &spannerv1beta1.SpannerAutoscaler{
Status: spannerv1beta1.SpannerAutoscalerStatus{
LastScaleTime: metav1.Time{Time: fakeTime.Add(-time.Minute)},
CurrentProcessingUnits: 1000,
DesiredProcessingUnits: 2000,
InstanceState: spannerv1beta1.InstanceStateReady,
},
Spec: spannerv1beta1.SpannerAutoscalerSpec{
ScaleConfig: spannerv1beta1.ScaleConfig{},
},
}
got := getOrConvertTimeDuration(sa.Spec.ScaleConfig.ScaleupInterval, testReconciler.scaleUpInterval)
Expect(got).To(Equal(want))
})

It("should override default scaleup interval with custom SpannerAutoscaler configuration value", func() {
want := 20 * time.Minute
scaleupInterval := metav1.Duration{
Duration: want,
}
sa := &spannerv1beta1.SpannerAutoscaler{
Status: spannerv1beta1.SpannerAutoscalerStatus{
LastScaleTime: metav1.Time{Time: fakeTime.Add(-time.Minute)},
CurrentProcessingUnits: 2000,
DesiredProcessingUnits: 1000,
InstanceState: spannerv1beta1.InstanceStateReady,
},
Spec: spannerv1beta1.SpannerAutoscalerSpec{
ScaleConfig: spannerv1beta1.ScaleConfig{
ScaleupInterval: &scaleupInterval,
},
},
}

got := getOrConvertTimeDuration(sa.Spec.ScaleConfig.ScaleupInterval, testReconciler.scaleUpInterval)
Expect(got).To(Equal(want))
})
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test case just seems to be checking the getOrConvertTieDuration function response. If so, maybe we can have a simpler test case just for that function, which doesn't need any reconciler or SpannerAutoscaler resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it simpler in b453448

@kazuki-hanai
Copy link
Contributor Author

@tkuchiki @tjun @rustycl0ck I responded all reviews you all requested. Please review it again 🙇‍♂️

Copy link
Contributor

@rustycl0ck rustycl0ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except the currentCPU values above 100 in all test cases, rest all looks ok.

Comment on lines 249 to 259
Entry("should scale up with ScaleupStepSize when currentPU is equal to zero", 150, 100, 10, 100, 10000, 2000, 0, 2000),
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 100, 100, 10, 100, 10000, 2000, 700, 800),
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 100, 300, 10, 100, 10000, 2000, 700, 1000),
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 100, 800, 10, 100, 10000, 2000, 700, 2000),
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 300, 800, 100, 100, 10000, 2000, 2000, 3000),
Entry("should scale up with ScaleupStepSize when currentPU is equal to 1000", 300, 1000, 100, 100, 10000, 2000, 700, 2000),
Entry("should scale up with ScaleupStepSize when currentPU is equal to 1000", 400, 1000, 100, 100, 10000, 2000, 3000, 4000),
Entry("should scale up with ScaleupStepSize when currentPU is equal to 1000", 400, 1000, 100, 100, 10000, 2000, 4000, 5000),
Entry("should scale up with ScaleupStepSize when currentPU is more than 1000", 200, 2000, 100, 100, 10000, 2000, 700, 3000),
Entry("should scale up with ScaleupStepSize when currentPU is more than 1000", 200, 2000, 100, 100, 10000, 2000, 2000, 4000),
Entry("should scale up with ScaleupStepSize when currentPU is more than 1000", 200, 2000, 100, 100, 10000, 2000, 3000, 5000),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. In all of these test cases the currentCPU value is 100, 150, 200, 300 or 400. But real CPU usage will probably be capped at 100%. It would be better to use realistic CPU usage values between 0-100% for these test cases to be useful.
  2. [nit] Since multiple entries have same name, if a test case actually fails, then it is not possible to know which values caused the failure. So either make the test case name string unique with more description, or add iterator numbers to differentiate between identical test case names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed currentCPU value in 2117743

@tkuchiki tkuchiki merged commit 82ac424 into mercari:master Jan 9, 2024
7 checks passed
@tkuchiki tkuchiki added the release/minor Indicates minor update to action-release-label label Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/minor Indicates minor update to action-release-label size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants