-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add/lifecycle heartbeat #1116
Add/lifecycle heartbeat #1116
Conversation
Regarding potential improvements: I think caching describeLifecycleHook is unnecessary because like you said termination events are not that frequent and would add unnecessary complexity. Regarding single heartbeat manager, I think it also adds unnecessary complexity Thanks for doing all the thorough manual testing! Can we also test with windows nodes? |
README.md
Outdated
|
||
### Important Notes | ||
|
||
- A lifecycle hook for instance termination is required for this feature. Longer grace periods are achieved by renewing the heartbeat timeout of the ASG's lifecycle hook. Instances terminate instantly without a hook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this section is important enough that it could be described in the How to use section? The How to use section currently seems sparse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with David on this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this part with detailed and extensive explanation
ASG_TERMINATE_EVENT_ONE_LINE=$(echo "${ASG_TERMINATE_EVENT}" | tr -d '\n' |sed 's/\"/\\"/g') | ||
SEND_SQS_CMD="awslocal sqs send-message --queue-url ${queue_url} --message-body \"${ASG_TERMINATE_EVENT_ONE_LINE}\" --region ${AWS_REGION}" | ||
kubectl exec -i "${localstack_pod}" -- bash -c "$SEND_SQS_CMD" | ||
echo "✅ Sent Spot Interruption Event to SQS queue: ${queue_url}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ASG termination event
if [[ $FOUND_HEARTBEAT_END_LOG -eq 0 ]] && kubectl logs -n kube-system "${NTH_POD}" | grep -q "Heartbeat deadline exceeded, stopping heartbeat"; then | ||
FOUND_HEARTBEAT_END_LOG=1 | ||
fi | ||
if [[ $HEARTBEAT_COUNT -eq 3 && $FOUND_HEARTBEAT_END_LOG -eq 1 ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Instead of hardcoding these values, can we abstract the values of HEARTBEAT_INTERVAL and HEARTBEAT_UNTIL to figure out expected number of Heartbeats at the top of the file? Just to make it easier to update this test file if need be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is your opinion about auto-calculating the HEARTBEAT_CHECK_CYCLES and HEARTBEAT_CHECK_SLEEP as well? They also derived from HEARTBEAT_INTERVAL and HEARTBEAT_UNTIL, it might be more consistent to either auto-calculate all related values or none of them. This would make the code easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I think that would be helpful thanks
|
||
if [[ $cordoned -eq 1 && $(kubectl get deployments regular-pod-test -o=jsonpath='{.status.unavailableReplicas}') -eq 1 ]]; then | ||
echo "✅ Verified the regular-pod-test pod was evicted!" | ||
echo "✅ ASG Lifecycle SQS Test Passed $CLUSTER_NAME! ✅" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "ASG Lifecycle SQS Test Passed with Heartbeat"
Just to avoid ambiguity with other tests that we run?
heartbeatTimeout := int(*lifecyclehook.LifecycleHooks[0].HeartbeatTimeout) | ||
|
||
if heartbeatInterval >= heartbeatTimeout { | ||
log.Warn().Msgf("Heartbeat interval (%d seconds) is equal to or greater than the heartbeat timeout (%d seconds) for the lifecycle hook %s. The node would likely be terminated before the heartbeat is sent", heartbeatInterval, heartbeatTimeout, *lifecyclehook.LifecycleHooks[0].LifecycleHookName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add the ASG name (lifecycleDetail.AutoScalingGroupName) in this log warn? Just to help with debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to David's comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some inline comments, good work @hyeong01 !
README.md
Outdated
|
||
### Important Notes | ||
|
||
- A lifecycle hook for instance termination is required for this feature. Longer grace periods are achieved by renewing the heartbeat timeout of the ASG's lifecycle hook. Instances terminate instantly without a hook. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with David on this comment.
heartbeatTimeout := int(*lifecyclehook.LifecycleHooks[0].HeartbeatTimeout) | ||
|
||
if heartbeatInterval >= heartbeatTimeout { | ||
log.Warn().Msgf("Heartbeat interval (%d seconds) is equal to or greater than the heartbeat timeout (%d seconds) for the lifecycle hook %s. The node would likely be terminated before the heartbeat is sent", heartbeatInterval, heartbeatTimeout, *lifecyclehook.LifecycleHooks[0].LifecycleHookName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to David's comment
On top of the changes suggested from the comment, I modified the logging to reduce duplicate logs and added a check that prevents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments
README.md
Outdated
- When NTH receives an ASG lifecycle termination event, it starts sending heartbeats to ASG to renew the heartbeat timeout associated with the ASG's termination lifecycle hook. | ||
- The heartbeat timeout acts as a timer that starts when the termination event begins. | ||
- Before the timeout reaches zero, the termination process is halted at the `Terminating:Wait` stage. | ||
- Previously, NTH couldn't issue heartbeats, limiting the maximum time for preventing termination to the maximum heartbeat timeout (7200 seconds). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't expected behavior of NTH when it was implemented, we shouldn't mention terms like previously in README, we should only concentrate on how the current functionality works...
README.md
Outdated
##### How to use | ||
|
||
- Configure a termination lifecycle hook on ASG (required). Set the heartbeat timeout value to be longer than the `Heartbeat Interval`. Each heartbeat signal resets this timeout, extending the duration that an instance remains in the `Terminating:Wait` state. Without this lifecycle hook, the instance will terminate immediately when termination event occurs. | ||
- Configure `Heartbeat Interval` (required) and `Heartbeat Until` (optional). NTH operates normally without heartbeats if neither value is set. If only the interval is specified, `Heartbeat Until` defaults to 172800 seconds (48 hours) and heartbeats will be sent. Providing both values enables NTH to run with heartbeats. `Heartbeat Until` must be provided with a valid `Heartbeat Interval`, otherwise NTH will fail to start. Any invalid values (wrong type or out of range) will also prevent NTH from starting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Providing both values enables NTH to run with heartbeats.
This statement is confusing, in the previous line you mentioned that if Interval is set then we will send heart beats, again mentioning this immediately makes me think that it is mandatory to set both values, could you please remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plus 1 to this point
@@ -166,6 +169,8 @@ type Config struct { | |||
CompleteLifecycleActionDelaySeconds int | |||
DeleteSqsMsgIfNodeNotFound bool | |||
UseAPIServerCacheToListPods bool | |||
HeartbeatInterval int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have some coverage around these newly added configs in config-test.go file...
fail_and_exit 2 | ||
fi | ||
|
||
ASG_TERMINATE_EVENT=$(cat <<EOF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: ASG_TERMINATION_EVENT
EOF | ||
) | ||
|
||
ASG_TERMINATE_EVENT_ONE_LINE=$(echo "${ASG_TERMINATE_EVENT}" | tr -d '\n' |sed 's/\"/\\"/g') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above TERMINATION
#### Issuing Lifecycle Heartbeats | ||
|
||
You can set NTH to send heartbeats to ASG in Queue Processor mode. This allows for a much longer grace period (up to 48 hours) for termination than the maximum heartbeat timeout of two hours. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a line that explains when this feature would be useful: e.g. When a customer has pods that have long-running drain tasks.
Changes made, thank you for your valuable insight! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Issue #, if available:
#493 NTH should issue lifecycle heartbeats
Description of changes:
How you tested your changes:
Environment (Linux / Windows): Linux
Kubernetes Version: v1.31.3
Unit testing:
E2E testing:
Manual testing with real ASG and K8s:
Potential Improvements:
Cached API call for getting heartbeat timeout from ASG (
describeLifecycleHook
).Single heartbeat manager (single gorountine for issuing heartbeats)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.