Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add/lifecycle heartbeat #1116

Merged
merged 31 commits into from
Jan 29, 2025
Merged

Conversation

hyeong01
Copy link
Contributor

@hyeong01 hyeong01 commented Jan 10, 2025

Issue #, if available:
#493 NTH should issue lifecycle heartbeats

Description of changes:

  1. Added the option to issue heartbeats
  • Configurable interval and length
  1. Added heartbeat configuration validity check
  2. Added automatic closure when lifecycle action becomes invalid
  3. Added unit and E2E testing
  4. Added explanation of the feature to readme

How you tested your changes:
Environment (Linux / Windows): Linux
Kubernetes Version: v1.31.3

Unit testing:

  • Checked heartbeat signals and closure due to heartbeat expiration
  • Checked heartbeat signals and closure due to drain completion
  • Checked heartbeat failures due to invalid lifecycle hook target
  • Checked heartbeat failures due to not invalid lifecycle hook target

E2E testing:

  • E2E tested heartbeat signal and closure due to heartbeat expiration with kind and localstack.

Manual testing with real ASG and K8s:

  • Tested all 32 possible configuration cases
  • Tested interval > timeout case
  • Graceful termination 5->3, 5->4, 10->3, 103->3 (corresponding number of workers for terminating instances) with varying intervals and length for each.
  • Important example 1: # of terminations=103->3, Interval=300, heartbeatUntil=590, timeout=590. Last interruption event entering the processInterruptionEvent took 223 seconds since the first event was stored to the eventStore. The first heartbeat was sent out 90 seconds after the last interruption event entering.
  • Important example 2: # of terminations=103->3, Interval=90, heartbeatUntil=260, timeout=150. The timeout was not enough for the last heartbeat to be sent out.
  • Ungraceful termination 5->3, 10->3, 20->3, 103->3 (smaller number of workers than terminating instances).

Potential Improvements:
Cached API call for getting heartbeat timeout from ASG (describeLifecycleHook).

  • Details: 1 hour cache for retrieving the heartbeat timeout value and warning user if timeout < interval.
  • Pros: Reduced number of API calls.
  • Cons: Complexity of the system. Cache expires or not used often because termination events happen sparsely.

Single heartbeat manager (single gorountine for issuing heartbeats)

  • Pros: Reduced number of goroutines
  • Cons: Potential burst in API calls. goroutine is light and the number of goroutines does not grow more than the number of workers (default 10).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hyeong01 hyeong01 requested a review from a team as a code owner January 10, 2025 03:47
@Lu-David
Copy link
Contributor

Lu-David commented Jan 10, 2025

Regarding potential improvements: I think caching describeLifecycleHook is unnecessary because like you said termination events are not that frequent and would add unnecessary complexity. Regarding single heartbeat manager, I think it also adds unnecessary complexity

Thanks for doing all the thorough manual testing! Can we also test with windows nodes?

README.md Outdated

### Important Notes

- A lifecycle hook for instance termination is required for this feature. Longer grace periods are achieved by renewing the heartbeat timeout of the ASG's lifecycle hook. Instances terminate instantly without a hook.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section is important enough that it could be described in the How to use section? The How to use section currently seems sparse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with David on this comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this part with detailed and extensive explanation

ASG_TERMINATE_EVENT_ONE_LINE=$(echo "${ASG_TERMINATE_EVENT}" | tr -d '\n' |sed 's/\"/\\"/g')
SEND_SQS_CMD="awslocal sqs send-message --queue-url ${queue_url} --message-body \"${ASG_TERMINATE_EVENT_ONE_LINE}\" --region ${AWS_REGION}"
kubectl exec -i "${localstack_pod}" -- bash -c "$SEND_SQS_CMD"
echo "✅ Sent Spot Interruption Event to SQS queue: ${queue_url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ASG termination event

if [[ $FOUND_HEARTBEAT_END_LOG -eq 0 ]] && kubectl logs -n kube-system "${NTH_POD}" | grep -q "Heartbeat deadline exceeded, stopping heartbeat"; then
FOUND_HEARTBEAT_END_LOG=1
fi
if [[ $HEARTBEAT_COUNT -eq 3 && $FOUND_HEARTBEAT_END_LOG -eq 1 ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Instead of hardcoding these values, can we abstract the values of HEARTBEAT_INTERVAL and HEARTBEAT_UNTIL to figure out expected number of Heartbeats at the top of the file? Just to make it easier to update this test file if need be

Copy link
Contributor Author

@hyeong01 hyeong01 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is your opinion about auto-calculating the HEARTBEAT_CHECK_CYCLES and HEARTBEAT_CHECK_SLEEP as well? They also derived from HEARTBEAT_INTERVAL and HEARTBEAT_UNTIL, it might be more consistent to either auto-calculate all related values or none of them. This would make the code easier to understand?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I think that would be helpful thanks


if [[ $cordoned -eq 1 && $(kubectl get deployments regular-pod-test -o=jsonpath='{.status.unavailableReplicas}') -eq 1 ]]; then
echo "✅ Verified the regular-pod-test pod was evicted!"
echo "✅ ASG Lifecycle SQS Test Passed $CLUSTER_NAME! ✅"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "ASG Lifecycle SQS Test Passed with Heartbeat"

Just to avoid ambiguity with other tests that we run?

heartbeatTimeout := int(*lifecyclehook.LifecycleHooks[0].HeartbeatTimeout)

if heartbeatInterval >= heartbeatTimeout {
log.Warn().Msgf("Heartbeat interval (%d seconds) is equal to or greater than the heartbeat timeout (%d seconds) for the lifecycle hook %s. The node would likely be terminated before the heartbeat is sent", heartbeatInterval, heartbeatTimeout, *lifecyclehook.LifecycleHooks[0].LifecycleHookName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add the ASG name (lifecycleDetail.AutoScalingGroupName) in this log warn? Just to help with debugging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to David's comment

Copy link
Contributor

@LikithaVemulapalli LikithaVemulapalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some inline comments, good work @hyeong01 !

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated

### Important Notes

- A lifecycle hook for instance termination is required for this feature. Longer grace periods are achieved by renewing the heartbeat timeout of the ASG's lifecycle hook. Instances terminate instantly without a hook.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with David on this comment.

pkg/config/config.go Outdated Show resolved Hide resolved
pkg/monitor/sqsevent/asg-lifecycle-event.go Show resolved Hide resolved
heartbeatTimeout := int(*lifecyclehook.LifecycleHooks[0].HeartbeatTimeout)

if heartbeatInterval >= heartbeatTimeout {
log.Warn().Msgf("Heartbeat interval (%d seconds) is equal to or greater than the heartbeat timeout (%d seconds) for the lifecycle hook %s. The node would likely be terminated before the heartbeat is sent", heartbeatInterval, heartbeatTimeout, *lifecyclehook.LifecycleHooks[0].LifecycleHookName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to David's comment

pkg/monitor/sqsevent/sqs-monitor_test.go Outdated Show resolved Hide resolved
pkg/monitor/sqsevent/sqs-monitor_test.go Outdated Show resolved Hide resolved
@hyeong01
Copy link
Contributor Author

On top of the changes suggested from the comment, I modified the logging to reduce duplicate logs and added a check that prevents hearbeatUntil < heartbeatInterval.

Copy link
Contributor

@LikithaVemulapalli LikithaVemulapalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments

README.md Outdated
- When NTH receives an ASG lifecycle termination event, it starts sending heartbeats to ASG to renew the heartbeat timeout associated with the ASG's termination lifecycle hook.
- The heartbeat timeout acts as a timer that starts when the termination event begins.
- Before the timeout reaches zero, the termination process is halted at the `Terminating:Wait` stage.
- Previously, NTH couldn't issue heartbeats, limiting the maximum time for preventing termination to the maximum heartbeat timeout (7200 seconds).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't expected behavior of NTH when it was implemented, we shouldn't mention terms like previously in README, we should only concentrate on how the current functionality works...

README.md Outdated
##### How to use

- Configure a termination lifecycle hook on ASG (required). Set the heartbeat timeout value to be longer than the `Heartbeat Interval`. Each heartbeat signal resets this timeout, extending the duration that an instance remains in the `Terminating:Wait` state. Without this lifecycle hook, the instance will terminate immediately when termination event occurs.
- Configure `Heartbeat Interval` (required) and `Heartbeat Until` (optional). NTH operates normally without heartbeats if neither value is set. If only the interval is specified, `Heartbeat Until` defaults to 172800 seconds (48 hours) and heartbeats will be sent. Providing both values enables NTH to run with heartbeats. `Heartbeat Until` must be provided with a valid `Heartbeat Interval`, otherwise NTH will fail to start. Any invalid values (wrong type or out of range) will also prevent NTH from starting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing both values enables NTH to run with heartbeats. This statement is confusing, in the previous line you mentioned that if Interval is set then we will send heart beats, again mentioning this immediately makes me think that it is mandatory to set both values, could you please remove this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus 1 to this point

README.md Outdated Show resolved Hide resolved
@@ -166,6 +169,8 @@ type Config struct {
CompleteLifecycleActionDelaySeconds int
DeleteSqsMsgIfNodeNotFound bool
UseAPIServerCacheToListPods bool
HeartbeatInterval int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have some coverage around these newly added configs in config-test.go file...

fail_and_exit 2
fi

ASG_TERMINATE_EVENT=$(cat <<EOF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ASG_TERMINATION_EVENT

EOF
)

ASG_TERMINATE_EVENT_ONE_LINE=$(echo "${ASG_TERMINATE_EVENT}" | tr -d '\n' |sed 's/\"/\\"/g')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above TERMINATION

#### Issuing Lifecycle Heartbeats

You can set NTH to send heartbeats to ASG in Queue Processor mode. This allows for a much longer grace period (up to 48 hours) for termination than the maximum heartbeat timeout of two hours.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a line that explains when this feature would be useful: e.g. When a customer has pods that have long-running drain tasks.

@hyeong01
Copy link
Contributor Author

Changes made, thank you for your valuable insight!

Copy link
Contributor

@LikithaVemulapalli LikithaVemulapalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@LikithaVemulapalli LikithaVemulapalli merged commit d483d25 into aws:main Jan 29, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants