[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

mergify · 2025-01-28T18:47:18Z

What does this PR do?

This PR introduces a fix for containerized Fleet-managed Elastic Agents to handle scenarios where:

The Fleet url changes.
The Fleet enrollment token changes.
The agent is unenrolled from Kibana Fleet.

It achieves the above by enhancing the agent's logic inside the container cmd to:

Verify enrollment conditions against stored state and only re-enroll when necessary.
Validate API token validity with Fleet, ensuring correct authentication even when the configuration hasn't changed.

The PR also adds a Kubernetes integration tests to verify the proper behaviour of enrollment and re-enrollment under various scenarios, including:

Re-deployment of agent with an updated fleet enrollment token.
Re-deployment of agent that is unenrolled from Kibana Fleet with the same enrollment token.
Re-deployment of agent across an older version and this one that ensures that no re-enrollment happens.

Key changes include:

Introduction of shouldFleetEnroll to centralize logic for fleet enrollment decisions.
Creation of PBKDF2-based enrollment token hashing to save in the agent state and be able to track token changes securely.
Utilise the agent/{id}/acks path of Fleet server with empty events to check if the agent API token is still valid. More than happy to introduce a separate path just for this cause on Fleet server, although it will be the same the ACKs one with empty events (as of now at least)

You can easily see here in the CI run of the first commit in this PR that the Elastic Agent, before this PR, doesn't handle enrollment correctly and always resorts to using the Fleet token and URL stored in its state. This PR addresses that issue and ensures enrollment uses the correct configuration and token.

PS: the actual changes of this PR are this commit b6596d0 which is +305 -25 thus I consider this PR aligned with the team policies 🙂

Why is it important?

This fix ensures robust and predictable behavior for containerized Fleet-managed Elastic Agents and enhance user experience of managing elastic-agent in Kubernetes.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

The changes introduced in this PR are non-disruptive. They improve fleet enrollment handling and maintain backward compatibility.

How to test this PR locally

mage integration:auth
PLATFORMS=linux/arm64 EXTERNAL=true SNAPSHOT=true PACKAGES=docker mage -v package 
INSTANCE_PROVISIONER=kind STACK_PROVISIONER=stateful K8S_VERSION=v1.31.1 SNAPSHOT=true mage integration:kubernetes

Related issues

Closes Elastic Agent doesn't update the enrollment token in Kubernetes Deployment statefile #3586

This is an automatic backport of pull request #6568 done by [Mergify](https://mergify.com).

* feat: add k8s integration test to check fleet enrollment * fix: container correct fleet enrollment when token changes or the agent is unenrolled * fix: update TestDiagnosticLocalConfig to include enrollment_token_hash * feat: add a simple retry logic while validate the stored agent api token * feat: add unit-test for shouldFleetEnroll * fix: improve unit-test explicitness and check for expected number of calls * fix: kind in changelog fragment * fix: split up ack-ing fleet in a separate function (cherry picked from commit 17814cc)

elasticmachine · 2025-01-28T18:47:34Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

elastic-sonarqube · 2025-01-28T19:42:46Z

Quality Gate passed

Issues
2 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
73.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube

mergify bot requested a review from a team as a code owner January 28, 2025 18:47

mergify bot added the backport label Jan 28, 2025

mergify bot requested review from andrzej-stencel and removed request for a team January 28, 2025 18:47

mergify bot assigned pkoutsovasilis Jan 28, 2025

mergify bot requested a review from pchila January 28, 2025 18:47

github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 28, 2025

pkoutsovasilis approved these changes Jan 28, 2025

View reviewed changes

pkoutsovasilis merged commit 1e81d7f into 8.x Jan 28, 2025
15 checks passed

pkoutsovasilis deleted the mergify/bp/8.x/pr-6568 branch January 28, 2025 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

mergify bot commented Jan 28, 2025

elasticmachine commented Jan 28, 2025

elastic-sonarqube bot commented Jan 28, 2025

[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

[8.x](backport #6568) Fix Fleet Enrollment Handling for Containerized Agent #6618

Conversation

mergify bot commented Jan 28, 2025

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

elasticmachine commented Jan 28, 2025

elastic-sonarqube bot commented Jan 28, 2025

Quality Gate passed