-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scaleup step size feature #132
Conversation
Co-authored-by: Junichiro Takagi <[email protected]>
Co-authored-by: Junichiro Takagi <[email protected]>
@tjun Thanks for corrections! |
@@ -108,6 +108,14 @@ type ScaleConfig struct { | |||
// The cool down period between two consecutive scaledown operations. If this option is omitted, the value of the `--scale-down-interval` command line option is taken as the default value. | |||
ScaledownInterval *metav1.Duration `json:"scaledownInterval,omitempty"` | |||
|
|||
// The maximum number of processing units which can be added in one scale-up operation. It can be a multiple of 100 for values < 1000, or a multiple of 1000 otherwise. | |||
// +kubebuilder:default=2000 | |||
ScaleupStepSize int `json:"scaleupStepSize,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For compatibility, could you please make the default value zero and allow unlimited scaling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it to zero and if scaleupStepSize = 0, I made no scale up size limitation.
https://github.com/mercari/spanner-autoscaler/pull/132/files#diff-1984e42ef0a0d5e7edf140b3ae6decf3fd6e1b9bd4ee351d4dcd8cf7213bb537R569-R577
|
||
// How often autoscaler is reevaluated for scale up. | ||
// The warm up period between two consecutive scaleup operations. If this option is omitted, the value of the `--scale-up-interval` command line option is taken as the default value. | ||
ScaleupInterval *metav1.Duration `json:"scaleupInterval,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as ScaleupStepSize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed it to 10 seconds because default interval for scaleup was set to 10 seconds before introducing scaleup feature.
https://github.com/mercari/spanner-autoscaler/pull/132/files#diff-1984e42ef0a0d5e7edf140b3ae6decf3fd6e1b9bd4ee351d4dcd8cf7213bb537L488
cmd/main.go
Outdated
@@ -60,6 +60,7 @@ var ( | |||
enableLeaderElection = flag.Bool("leader-elect", false, "Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.") | |||
leaderElectionID = flag.String("leader-elect-id", "", "Lease name for leader election.") | |||
scaleDownInterval = flag.Duration("scale-down-interval", 55*time.Minute, "The scale down interval.") | |||
scaleUpInterval = flag.Duration("scale-up-interval", 55*time.Minute, "The scale up interval.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as the comment of ScaleupStepSize #132 (comment)
Co-authored-by: tkuchiki <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the following manifests. The behavior seems to be fine.
Set scalupStepSize and scaleupInterval
apiVersion: spanner.mercari.com/v1beta1
kind: SpannerAutoscaler
metadata:
labels:
app.kubernetes.io/name: spannerautoscaler
app.kubernetes.io/instance: test-instance
app.kubernetes.io/part-of: spanner-autoscaler
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: spanner-autoscaler
name: spannerautoscaler-sample-beta
namespace: test
spec:
targetInstance:
projectId: test-project
instanceId: test-instance
authentication:
iamKeySecret:
namespace: test
name: spanner-autoscaler-gcp-sa
key: service-account
scaleConfig:
processingUnits:
min: 100
max: 1000
targetCPUUtilization:
highPriority: 30
scaledownStepSize: 200
scaledownInterval: 3m
scaleupStepSize: 100
scaleupInterval: 3m

Set scaleupInterval
apiVersion: spanner.mercari.com/v1beta1
kind: SpannerAutoscaler
metadata:
labels:
app.kubernetes.io/name: spannerautoscaler
app.kubernetes.io/instance: test-instance
app.kubernetes.io/part-of: spanner-autoscaler
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/created-by: spanner-autoscaler
name: spannerautoscaler-sample-beta
namespace: test
spec:
targetInstance:
projectId: test-project
instanceId: test-instance
authentication:
iamKeySecret:
namespace: test
name: spanner-autoscaler-gcp-sa
key: service-account
scaleConfig:
processingUnits:
min: 100
max: 10000
targetCPUUtilization:
highPriority: 30
scaledownStepSize: 1000
scaledownInterval: 3m
scaleupInterval: 3m

Co-authored-by: tkuchiki <[email protected]>
@tkuchiki Thank you for reviewing. I checked your comments and suggestions and responded those. |
Co-authored-by: Junichiro Takagi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -485,7 +500,7 @@ func (r *SpannerAutoscalerReconciler) needUpdateProcessingUnits(log logr.Logger, | |||
log.Info("no need to scale", "currentPU", currentProcessingUnits, "currentCPU", sa.Status.CurrentHighPriorityCPUUtilization) | |||
return false | |||
|
|||
case desiredProcessingUnits > currentProcessingUnits && r.clock.Now().Before(sa.Status.LastScaleTime.Time.Add(10*time.Second)): | |||
case desiredProcessingUnits > currentProcessingUnits && r.clock.Now().Before(sa.Status.LastScaleTime.Time.Add(getOrConvertTimeDuration(sa.Spec.ScaleConfig.ScaleupInterval, r.scaleUpInterval))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the scaleUpInterval
is zero, then I think it might try to update Spanner instance back to back without allowing any time for Google Cloud to actually perform the scale-up operation. Did it work fine during real scale-up testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rustycl0ck
I think it's better to set default scaleupInterval as 10 seconds because default interval was 10 seconds before this PR. How do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting default scaleUpInterval
as less than 60 doesn't work anything because reconcile interval is 60 seconds. So I set the default value as 60 seconds.
if suStepSize < 1000 && sa.Status.CurrentProcessingUnits+suStepSize > 1000 { | ||
suStepSize = 1000 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For suStepSize == 0
(which is the default), suStepSize
will always be set to 1000
if the current size of spanner instance is more than 1000 PUs. Is this expected?
@@ -146,7 +148,7 @@ var _ = Describe("Check Update Nodes", func() { | |||
Expect(got).To(BeFalse()) | |||
}) | |||
|
|||
It("needs to scale up nodes because cooldown interval is only applied to scale down operations", func() { | |||
It("does not need to scale up nodes because enough time has not elapsed since last update", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case should have 4 condition checks:
- scales down when outside
scaleDownInterval
- does not scale down when inside
scaleDownInterval
- scales up when outside
scaleUpInterval
- does not scale up when inside
scaleUpInterval
Current PR reduces it to only check conditions number 2 and 4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing.
I fixed it in a85f59e
Entry("should not scale", 25, 200, 30, 100, 1000, 2000, 2000, 200), | ||
Entry("should scale up 1", 50, 300, 30, 100, 1000, 2000, 2000, 600), | ||
Entry("should scale up 2", 50, 1000, 30, 1000, 10000, 2000, 2000, 2000), | ||
Entry("should scale up 3", 50, 900, 40, 100, 5000, 2000, 2000, 2000), | ||
Entry("should scale down 1", 30, 500, 50, 100, 10000, 2000, 2000, 400), | ||
Entry("should scale down 2", 30, 5000, 50, 1000, 10000, 2000, 2000, 4000), | ||
Entry("should scale down 3", 25, 1000, 65, 300, 10000, 800, 2000, 400), | ||
Entry("should scale down 4", 25, 800, 65, 300, 10000, 800, 2000, 400), | ||
Entry("should scale down 5", 25, 700, 65, 300, 10000, 800, 2000, 300), | ||
Entry("should scale up to max PUs 1", 50, 300, 30, 100, 400, 2000, 2000, 400), | ||
Entry("should scale up to max PUs 2", 50, 3000, 30, 1000, 4000, 2000, 2000, 4000), | ||
Entry("should scale down to min PUs 1", 0, 500, 50, 100, 1000, 2000, 2000, 100), | ||
Entry("should scale down to min PUs 2", 0, 5000, 50, 1000, 10000, 5000, 2000, 1000), | ||
Entry("should scale down to min PUs 3", 0, 5000, 50, 100, 10000, 5000, 2000, 100), | ||
Entry("should scale down with ScaledownStepSize 1", 30, 10000, 50, 5000, 10000, 2000, 2000, 8000), | ||
Entry("should scale down with ScaledownStepSize 2", 30, 10000, 50, 5000, 12000, 200, 2000, 9000), | ||
Entry("should scale down with ScaledownStepSize 3", 30, 10000, 50, 5000, 12000, 100, 2000, 9000), | ||
Entry("should scale down with ScaledownStepSize 4", 30, 10000, 50, 5000, 12000, 900, 2000, 9000), | ||
Entry("should scale down with ScaledownStepSize 5", 25, 2000, 65, 300, 10000, 500, 2000, 1000), | ||
Entry("should scale down with ScaledownStepSize 6", 25, 2000, 65, 300, 10000, 800, 2000, 1000), | ||
Entry("should scale down with ScaledownStepSize 7", 25, 1000, 65, 300, 10000, 500, 2000, 500), | ||
Entry("should scale down with ScaledownStepSize 8", 20, 800, 75, 300, 10000, 200, 2000, 600), | ||
Entry("should scale down with ScaledownStepSize 9", 25, 2000, 50, 300, 10000, 500, 2000, 2000), | ||
Entry("should scale down with ScaledownStepSize 10", 25, 2000, 50, 300, 10000, 1000, 2000, 2000), | ||
Entry("should scale down with ScaledownStepSize 11", 25, 2000, 70, 300, 10000, 400, 2000, 1000), | ||
Entry("should scale down with ScaledownStepSize 12", 25, 1000, 70, 300, 10000, 400, 2000, 600), | ||
Entry("should scale down with ScaledownStepSize 13", 20, 2000, 50, 300, 10000, 200, 2000, 1000), | ||
Entry("should scale down with ScaledownStepSize 14", 20, 2000, 75, 300, 10000, 200, 2000, 1000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In these old test cases, the default value of scaleupStepSize
has been set to 2000
, but the actual default value is 0
. So when this value is not being tested for any effect, it should be set to the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this in c4ef133
var _ = Describe("Get and overwrite scaleup interval", func() { | ||
var testReconciler *SpannerAutoscalerReconciler | ||
controllerScaleUpInterval := 55 * time.Minute | ||
|
||
BeforeEach(func() { | ||
By("Creating a test reconciler") | ||
testReconciler = &SpannerAutoscalerReconciler{ | ||
scaleUpInterval: controllerScaleUpInterval, | ||
clock: testingclock.NewFakeClock(fakeTime), | ||
log: logr.Discard(), | ||
} | ||
}) | ||
|
||
It("should get controller default scaleup interval", func() { | ||
want := controllerScaleUpInterval | ||
sa := &spannerv1beta1.SpannerAutoscaler{ | ||
Status: spannerv1beta1.SpannerAutoscalerStatus{ | ||
LastScaleTime: metav1.Time{Time: fakeTime.Add(-time.Minute)}, | ||
CurrentProcessingUnits: 1000, | ||
DesiredProcessingUnits: 2000, | ||
InstanceState: spannerv1beta1.InstanceStateReady, | ||
}, | ||
Spec: spannerv1beta1.SpannerAutoscalerSpec{ | ||
ScaleConfig: spannerv1beta1.ScaleConfig{}, | ||
}, | ||
} | ||
got := getOrConvertTimeDuration(sa.Spec.ScaleConfig.ScaleupInterval, testReconciler.scaleUpInterval) | ||
Expect(got).To(Equal(want)) | ||
}) | ||
|
||
It("should override default scaleup interval with custom SpannerAutoscaler configuration value", func() { | ||
want := 20 * time.Minute | ||
scaleupInterval := metav1.Duration{ | ||
Duration: want, | ||
} | ||
sa := &spannerv1beta1.SpannerAutoscaler{ | ||
Status: spannerv1beta1.SpannerAutoscalerStatus{ | ||
LastScaleTime: metav1.Time{Time: fakeTime.Add(-time.Minute)}, | ||
CurrentProcessingUnits: 2000, | ||
DesiredProcessingUnits: 1000, | ||
InstanceState: spannerv1beta1.InstanceStateReady, | ||
}, | ||
Spec: spannerv1beta1.SpannerAutoscalerSpec{ | ||
ScaleConfig: spannerv1beta1.ScaleConfig{ | ||
ScaleupInterval: &scaleupInterval, | ||
}, | ||
}, | ||
} | ||
|
||
got := getOrConvertTimeDuration(sa.Spec.ScaleConfig.ScaleupInterval, testReconciler.scaleUpInterval) | ||
Expect(got).To(Equal(want)) | ||
}) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test case just seems to be checking the getOrConvertTieDuration
function response. If so, maybe we can have a simpler test case just for that function, which doesn't need any reconciler or SpannerAutoscaler resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it simpler in b453448
Co-authored-by: Ravi <[email protected]>
Co-authored-by: Ravi <[email protected]>
da07818
to
b453448
Compare
@tkuchiki @tjun @rustycl0ck I responded all reviews you all requested. Please review it again 🙇♂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except the currentCPU
values above 100 in all test cases, rest all looks ok.
Entry("should scale up with ScaleupStepSize when currentPU is equal to zero", 150, 100, 10, 100, 10000, 2000, 0, 2000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 100, 100, 10, 100, 10000, 2000, 700, 800), | ||
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 100, 300, 10, 100, 10000, 2000, 700, 1000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 100, 800, 10, 100, 10000, 2000, 700, 2000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is lower than 1000", 300, 800, 100, 100, 10000, 2000, 2000, 3000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is equal to 1000", 300, 1000, 100, 100, 10000, 2000, 700, 2000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is equal to 1000", 400, 1000, 100, 100, 10000, 2000, 3000, 4000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is equal to 1000", 400, 1000, 100, 100, 10000, 2000, 4000, 5000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is more than 1000", 200, 2000, 100, 100, 10000, 2000, 700, 3000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is more than 1000", 200, 2000, 100, 100, 10000, 2000, 2000, 4000), | ||
Entry("should scale up with ScaleupStepSize when currentPU is more than 1000", 200, 2000, 100, 100, 10000, 2000, 3000, 5000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- In all of these test cases the
currentCPU
value is 100, 150, 200, 300 or 400. But real CPU usage will probably be capped at 100%. It would be better to use realistic CPU usage values between 0-100% for these test cases to be useful. - [nit] Since multiple entries have same name, if a test case actually fails, then it is not possible to know which values caused the failure. So either make the test case name string unique with more description, or add iterator numbers to differentiate between identical test case names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed currentCPU
value in 2117743
Co-authored-by: Ravi <[email protected]>
What this PR does
scaleupStepSize
andscaleUpInterval
field to control scale up featurescaleUpStepSize
: The minimum value by which the scaling will increase in one step.scaleUpInterval
: The time interval at which the scaling up will occur.Why we need it
In a scaling up situation, the rapid growth of processing units (PU) can cause an increase in latency for Spanner by rearrangement of splits. To prevent this issue, we propose a controlling method for scaling up the PUs.
Which issue(s) this PR fixes