Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

Merged
merged 1 commit into from
Jan 28, 2025

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Jan 28, 2025

What does this PR do?

This PR introduces a fix for containerized Fleet-managed Elastic Agents to handle scenarios where:

  • The Fleet url changes.
  • The Fleet enrollment token changes.
  • The agent is unenrolled from Kibana Fleet.

It achieves the above by enhancing the agent's logic inside the container cmd to:

  • Verify enrollment conditions against stored state and only re-enroll when necessary.
  • Validate API token validity with Fleet, ensuring correct authentication even when the configuration hasn't changed.

The PR also adds a Kubernetes integration tests to verify the proper behaviour of enrollment and re-enrollment under various scenarios, including:

  • Re-deployment of agent with an updated fleet enrollment token.
  • Re-deployment of agent that is unenrolled from Kibana Fleet with the same enrollment token.
  • Re-deployment of agent across an older version and this one that ensures that no re-enrollment happens.

Key changes include:

  • Introduction of shouldFleetEnroll to centralize logic for fleet enrollment decisions.
  • Creation of PBKDF2-based enrollment token hashing to save in the agent state and be able to track token changes securely.
  • Utilise the agent/{id}/acks path of Fleet server with empty events to check if the agent API token is still valid. More than happy to introduce a separate path just for this cause on Fleet server, although it will be the same the ACKs one with empty events (as of now at least)

You can easily see here in the CI run of the first commit in this PR that the Elastic Agent, before this PR, doesn't handle enrollment correctly and always resorts to using the Fleet token and URL stored in its state. This PR addresses that issue and ensures enrollment uses the correct configuration and token.

PS: the actual changes of this PR are this commit b6596d0 which is +305 -25 thus I consider this PR aligned with the team policies 🙂

Why is it important?

This fix ensures robust and predictable behavior for containerized Fleet-managed Elastic Agents and enhance user experience of managing elastic-agent in Kubernetes.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

The changes introduced in this PR are non-disruptive. They improve fleet enrollment handling and maintain backward compatibility.

How to test this PR locally

mage integration:auth
PLATFORMS=linux/arm64 EXTERNAL=true SNAPSHOT=true PACKAGES=docker mage -v package 
INSTANCE_PROVISIONER=kind STACK_PROVISIONER=stateful K8S_VERSION=v1.31.1 SNAPSHOT=true mage integration:kubernetes

Related issues


This is an automatic backport of pull request #6568 done by [Mergify](https://mergify.com).

* feat: add k8s integration test to check fleet enrollment

* fix: container correct fleet enrollment when token changes or the agent is unenrolled

* fix: update TestDiagnosticLocalConfig to include enrollment_token_hash

* feat: add a simple retry logic while validate the stored agent api token

* feat: add unit-test for shouldFleetEnroll

* fix: improve unit-test explicitness and check for expected number of calls

* fix: kind in changelog fragment

* fix: split up ack-ing fleet in a separate function

(cherry picked from commit 17814cc)
@mergify mergify bot requested a review from a team as a code owner January 28, 2025 18:47
@mergify mergify bot added the backport label Jan 28, 2025
@mergify mergify bot requested review from andrzej-stencel and removed request for a team January 28, 2025 18:47
@mergify mergify bot requested a review from pchila January 28, 2025 18:47
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 28, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis merged commit 1e81d7f into 8.x Jan 28, 2025
15 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/8.x/pr-6568 branch January 28, 2025 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants