Switch k8s input paths to /var/log/pods/* to ingest rotated container logs #6583

pchila · 2025-01-23T15:19:58Z

What does this PR do?

This PR changes the input definition for ingesting k8s container logs from /var/log/containers/* to /var/log/pods/* which contains all available logs (including rotated ones) for pods running on the k8s node.

Before this change elastic-agent would configure the input using the currently running container id and locating the corresponding log file under /var/log/containers/* which would point to the current log file, ignoring previously rotated log files for the container.

This change only impacts standalone agents. To have the same behaviour for managed agents a separate PR must be opened for the kubernetes integration in https://github.com/elastic/integrations/tree/main/packages/kubernetes

Why is it important?

This allow ingesting logs also for containers that rotate logs quickly (for example containers that are crashing very early in their execution) instad of trying to ingest only the last log file for the container.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
~~[ ] I have commented my code, particularly in hard-to-understand areas~~
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
~~[ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Closes [k8s provider] Look up metadata based on file path #5164

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-01-23T15:20:35Z

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2025-01-23T15:20:36Z

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

cmacknz · 2025-01-23T17:57:43Z

Other references to /var/log/containers to possibly correct:

blakerouse · 2025-01-23T19:30:44Z

@pchila @pkoutsovasilis Do we need to worry about this change causing the files to be re-indexed? How filebeat stores files in the registry is not 100% clear in my mind, I would hope that it would not do that, but maybe?

cmacknz · 2025-01-23T19:53:33Z

@pchila @pkoutsovasilis Do we need to worry about this change causing the files to be re-indexed? How filebeat stores files in the registry is not 100% clear in my mind, I would hope that it would not do that, but maybe?

It shouldn't https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2

The default is either device ID + inode but we are switching everything to fingerprint because inode reuse is more common than we originally thought. CC @rdner in case there is some subtlety here that I'm missing.

pkoutsovasilis · 2025-01-23T21:38:42Z

It shouldn't https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2

The default is either device ID + inode but we are switching everything to fingerprint because inode reuse is more common than we originally thought. CC @rdner in case there is some subtlety here that I'm missing.

Just echoing the same message; AFAICT nothing should be re-indexed as fingerprint is enabled in filestream 🙂

rdner · 2025-01-24T08:32:50Z

deploy/helm/elastic-agent/templates/integrations/_kubernetes/_kubernetes_logs_containers.tpl

@@ -16,12 +16,12 @@ Config input for container logs
    namespace: {{ .Values.kubernetes.namespace }}
  use_output: {{ .Values.kubernetes.output }}
  streams:
-  - id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
+  - id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}
    data_stream:
      dataset: kubernetes.container_logs
      type: logs


As long as this translates to a running filestream input and we don't switch from 8.x to 9.x the change in the PR should cause no re-indexing.

Starting with 9.0 we will have fingerprint as a default file identity which is based on the file content, not its filesystem metadata.

Does this mean that an upgrade from 8.x to 9.x will result in a re-index of all files? That would be very bad.

Does this mean that an upgrade from 8.x to 9.x will result in a re-index of all files? That would be very bad.

No, we also implemented a feature to migrate the state from the old default (native that uses inode + device ID) to the new default fingerprint. If you're curious, here is the PR.

rdner · 2025-01-24T08:37:09Z

Just echoing the same message; AFAICT nothing should be re-indexed as fingerprint is enabled in filestream 🙂

@pkoutsovasilis

There is a problem with the template you linked, fingerprint is default now but this template expects the opposite, probably needs to be changed.

pkoutsovasilis · 2025-01-24T08:49:10Z

@pkoutsovasilis

There is a problem with the template you linked, fingerprint is default now but this template expects the opposite, probably needs to be changed.

@rdner I believe you refer to {{#if useFingerprint}} part which by default is true even for the template, thus I would say that the template expects this to be enabled, what am I missing? 🙂

rdner · 2025-01-24T09:59:19Z

@pkoutsovasilis since 9.0 fingerprint is enabled by default. If the template has useFingerprint = false it will still be enabled. I don't think this is what expected there.

pchila · 2025-01-28T14:29:50Z

@rdner Re: fingerprint mechanism I performed a quick test with a container that would output a very short log before exiting and I noticed that having fingerprint enabled (with default length of 1024) would result in logs not being ingested even after the container terminated... Is this a known issue/limitation/design feature ?

Edit: forgot to add some details about the test

elastic-package stack with 9.0.0-SNAPSHOT version
mingrammer/flog deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: flog
  name: flog
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flog
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: flog
    spec:
      containers:
      - image: mingrammer/flog:latest
        name: log
        args:
          - flog
          - -b
          - "512"
          - -f
          - json
      restartPolicy: Always

running on kind version 0.26.0, k8s v1.32.0

rdner · 2025-01-28T15:20:06Z

@pchila yes it's a known and documented limitation.

It's adjustable down to 64 bytes, but a smaller size would increase the risk of a hash collision.

Now the fingerprint is enabled by default perhaps we want to reconsider the default size, what do you think @belimawr @cmacknz ?

belimawr · 2025-01-28T16:04:35Z

deploy/helm/elastic-agent/templates/integrations/_kubernetes/_kubernetes_logs_containers.tpl

@@ -16,12 +16,12 @@ Config input for container logs
    namespace: {{ .Values.kubernetes.namespace }}
  use_output: {{ .Values.kubernetes.output }}
  streams:
-  - id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
+  - id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}


The new ID does not seem to be unique, this will cause problems and lead to data duplication.
Are you sure that
${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name} will always be unique and will make this block be rendered only once?

The previous implementation used the ${kubernetes.container.id} because that's guaranteed to always be unique.

Suggested change

- id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}

- id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}

The key thing we have to ensure here is that there will never be two Filestream inputs started with the same ID.

I see this file is in the helm chart folder, how does upgrade here work? Is there any change an user will upgrade and have the input ID changing? If so, that will lead to data duplication because the new ID will make all files be considered new, thus fully re-ingested.

@belimawr some things I have in mind, the combination of ${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name} should always be unique in a k8s cluster, right?!

also about the

how does upgrade here work

indeed when we restart shouldn't fingerprint be the same no matter the ID?! that said, if we have to do it then we should do it now that helm chart is not GA yet?! 🙂

the combination of ${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name} should always be unique in a k8s cluster, right?!

I hope so, however I'm not sure. It would be nice to back it with some K8s documentation. I know for sure ${kubernetes.container.id} is always unique, hence it's the one I've always used/recommended.

indeed when we restart shouldn't fingerprint be the same no matter the ID?!

Yes, however, the input ID is part of the state ID, hence changing the input ID will affect the final state of the file.

Within a single instance of the Filestream input, files are identified by the file_identity, in this case it's fingerpring, however the Filestream input will only look at states that match it's ID.

Here is an example of an registry (file state) entry:

{ "k": "filestream::my-filestream-id::native::8787338-65029", "v": { "ttl": 1800000000000, "updated": [ 516104569337, 1653410356 ], "cursor": { "offset": 29 }, "meta": { "source": "/home/n/go/src/github.com/elastic/beats/filebeat/test.log", "identifier_name": "native" } } }

Even if the file is the same, changing the input ID will change the key: "k": "filestream::my-filestream-id::native::8787338-65029".

That's actually one of the benefits of Filestream over Log input, because the input ID is part of the state ID, we can have different instances of the input harvesting the same file as long as the input ID is different.

Quoting from k8s docs https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#when-to-use-multiple-namespaces

Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces.
hence, given a namespace + podname pair it's guaranteed to be unique in the cluster (as opposed to the previous id that was NOT namespaced)

Within a pod spec container names must be unique so once the pod is created the tuple (namespace, pod name, container name) is unique and the input definition does not change with container restarts.

Thanks for checking the docs and clarifying @pchila !!

Then ignore the change I suggested.

The only issue that persists is the ID changing when the helm chart is upgraded... That will cause re-ingestion of files, even with fingerprint.

belimawr · 2025-01-28T16:09:00Z

...mize/default/elastic-agent-standalone/base/elastic-agent-standalone-daemonset-configmap.yaml

@@ -354,7 +354,7 @@ data:
            condition: '${host.platform} == ''windows'''
            ignore_older: 72h
      # Input ID allowing Elastic Agent to track the state of this input. Must be unique.
-      - id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
+      - id: container-log-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.id}


Same thing here and in the next ones about updating the ID. I see it is in the kustomize folder, however I'm not 100% sure how this file is used. Is there any chance that during an (automated) upgrade a user will go from the old ID to the new one?

This is related to the "old school" way of deploying agent in kubernetes... For the uniqueness see the previous response.

Yes it's very likely that either upgrading the agent or upgrading the kubernetes integration we may switch from one input to the other.

The files however should be the same as the /var/log/container/... are symlinks to the real log files in /var/log/pods/... so we should actually read the same file since the symlinks are dereferenced if prospector.scanner.symlinks: true is specified. Does this mean that filebeat will not keep track of the actual file metadata in the registry ?

Does this mean that filebeat will not keep track of the actual file metadata in the registry ?

TL;DR: Not in the way you expect, if you change the input ID, that input will re-ingest all files.

Longer version:
The registry for Filestream is "namespaced" by the input ID, an instance of the Filestream input will only have access to entries in the registry that match its ID. See the registry entry I mentioned on #6583 (comment). The state key is

filestream::my-filestream-id::native::8787338-65029

The format is:

filestream::<input ID>::<file identity name>::<state ID>

On the example above:

Input ID: my-filestream-id

File identity name: native

State ID: 8787338-65029.

If any of those change, it's a new entry in the registry and the file is re-ingested.

We can jump on a call tomorrow if you have any questions.

I just tested switching from the new to the old input id using the standalone k8s manifest and I observed a reingestion of container logs because of the change of input id.

Changing datastream id does not look like it triggers any re-ingestion.

This means that people upgrading from agents installed using plain k8s manifests to installing agent with the helm chart will experience the same reingestion since the input id is different there as well 😞

To be clear this will happen only once during the switch and only for the logs that are present on the node at that moment.

@nimarezainia @cmacknz @jlind23 what is your opinion on reingestion due to filestream input id change? Is it acceptable to reingest "once" when the input id changes (when we switch from the old input definition to the new or from the plan k8s manifest to the helm chart) ?

Why make this change then? Why not just keep it the same? ${kubernetes.container.id} is always unique.

Adding on my previous response: if the input id changes and the path matches all the available logs for the container in the pod (including the rotated ones) we will have log duplication at every container restart.

If we don't target all the logs for the container in the pod under /var/log/pods/* we are gonna miss rotated logs and not implement a solution for #5164

Missing logs is worse than one time duplication of logs IMO. We changed the ID before in the past as it was not unique and there wasn't much backlash.

If we are convinced this is 100% more correct it is probably worth doing. If we are really worried about duplication only make the change in 9.0 where the file identity is changing anyway.

The way most hosted k8s providers do log rotation, we are looking at re-ingesting 10 MB of logs per node at worst. This is because the log rotation is 5 10 MB files but the oldest 4 are gzipped compressed and we can't read them. We can only read the file actively being written to.

I understand, thanks for the explination @pchila.

Is it possible to tell filebeat that it used to have an old ID so it can use it as a basis and then migrate to the new ID? I don't know if that is possible or if its hard to add. Maybe having something like:

registry: previous_id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}

Just a thought to provide the best experience for our customers. Doing this in 9.0 only is probably safest, but I think we would need to place a note in the release notes that upgrade from 8.x to 9.x would result in a re-ingestion of all container logs. I think that is just something that is crazy to accept for an upgrade path.... but it is a path...

Blake I both thought of the previous_id idea at about the same time (#6583 (comment)), I think we should definitely explore this.

I created elastic/beats#42472 to track and discuss this idea.

Just for context, when we change the file identity in 9.0 there will be no data re-ingestion because we can migrate the state from native (the current default) to fingerprint (the new default), that was all implemented by elastic/beats#41762.

Regarding the previous_id idea, I'll comment on the issue itself.

belimawr · 2025-01-28T16:20:37Z

It shouldn't https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2
The default is either device ID + inode but we are switching everything to fingerprint because inode reuse is more common than we originally thought. CC @rdner in case there is some subtlety here that I'm missing.

Just echoing the same message; AFAICT nothing should be re-indexed as fingerprint is enabled in filestream 🙂

While that is correct, it is still incomplete. The input ID is part of the state ID, so changing the input ID (I mean the Filestream input ID, not the Elastic-Agent unit/input ID) will cause all files to be re-ingested regardless of the file identity used.

There is no way to migrate state from one input instance to another.

@blakerouse's concern is very valid and real here

@pchila @pkoutsovasilis Do we need to worry about this change causing the files to be re-indexed?

If we change the IDs, like we're doing in deploy/helm/elastic-agent/templates/integrations/_kubernetes/_kubernetes_logs_containers.tpl files will be re-ingested.

The question for me is more about how (if any) an update happens. My understanding of helm charts is that they can be updated, however if we're only talking about static examples in our documentation (like our example manifest linked in our docs), then there is no issue because users won't be downloading/updating their manifests every new release. Well, that's what I'd expect users to do.

… logs

elastic-sonarqube · 2025-01-29T14:31:33Z

Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

cmacknz · 2025-01-29T14:47:16Z

There is no way to migrate state from one input instance to another.

We could build one if we really needed it. We could add an old_id field with the old pattern and then we could have the input pick up from there and write updates with the new one or something. This isn't the first time we've needed to change the input ID in the k8s integration so maybe it's worth thinking about this.

cmacknz · 2025-01-29T17:05:41Z

Created elastic/beats#42472 to track the idea of just telling the filestream input what its old ID was.

pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 23, 2025

pchila self-assigned this Jan 23, 2025

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 23, 2025

rdner reviewed Jan 24, 2025

View reviewed changes

pchila mentioned this pull request Jan 28, 2025

Switch container logs input to /var/log/pod/* path elastic/integrations#12500

Draft

5 tasks

belimawr reviewed Jan 28, 2025

View reviewed changes

pchila added 2 commits January 29, 2025 14:34

Switch k8s input paths to /var/log/pods/* to ingest rotated container…

bc1945c

… logs

Fix k8s container logs input id using container names

7e4b93d

pchila force-pushed the ingest-rotated-container-logs branch from d3b1872 to 7e4b93d Compare January 29, 2025 13:34

cmacknz mentioned this pull request Jan 29, 2025

Allow changing filestream IDs without duplication by providing the previous ID values elastic/beats#42472

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch k8s input paths to /var/log/pods/* to ingest rotated container logs #6583

Switch k8s input paths to /var/log/pods/* to ingest rotated container logs #6583

pchila commented Jan 23, 2025

mergify bot commented Jan 23, 2025

mergify bot commented Jan 23, 2025

cmacknz commented Jan 23, 2025

blakerouse commented Jan 23, 2025

cmacknz commented Jan 23, 2025

pkoutsovasilis commented Jan 23, 2025

rdner Jan 24, 2025

blakerouse Jan 24, 2025

belimawr Jan 28, 2025

rdner commented Jan 24, 2025

pkoutsovasilis commented Jan 24, 2025 •

edited

Loading

rdner commented Jan 24, 2025

pchila commented Jan 28, 2025 •

edited

Loading

rdner commented Jan 28, 2025

belimawr Jan 28, 2025

belimawr Jan 28, 2025

pkoutsovasilis Jan 28, 2025 •

edited

Loading

belimawr Jan 28, 2025 •

edited

Loading

pchila Jan 28, 2025

belimawr Jan 29, 2025

belimawr Jan 28, 2025

pchila Jan 28, 2025 •

edited

Loading

belimawr Jan 29, 2025

pchila Jan 29, 2025

blakerouse Jan 29, 2025

pchila Jan 29, 2025

cmacknz Jan 29, 2025 •

edited

Loading

blakerouse Jan 29, 2025

cmacknz Jan 29, 2025

belimawr Jan 29, 2025

belimawr commented Jan 28, 2025

elastic-sonarqube bot commented Jan 29, 2025

cmacknz commented Jan 29, 2025

cmacknz commented Jan 29, 2025

	- id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}
	- id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}

Switch k8s input paths to /var/log/pods/* to ingest rotated container logs #6583

Are you sure you want to change the base?

Switch k8s input paths to /var/log/pods/* to ingest rotated container logs #6583

Conversation

pchila commented Jan 23, 2025

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

mergify bot commented Jan 23, 2025

mergify bot commented Jan 23, 2025

cmacknz commented Jan 23, 2025

blakerouse commented Jan 23, 2025

cmacknz commented Jan 23, 2025

pkoutsovasilis commented Jan 23, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdner commented Jan 24, 2025

pkoutsovasilis commented Jan 24, 2025 • edited Loading

rdner commented Jan 24, 2025

pchila commented Jan 28, 2025 • edited Loading

rdner commented Jan 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pkoutsovasilis Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

belimawr Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pchila Jan 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmacknz Jan 29, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

belimawr commented Jan 28, 2025

elastic-sonarqube bot commented Jan 29, 2025

Quality Gate passed

cmacknz commented Jan 29, 2025

cmacknz commented Jan 29, 2025

pkoutsovasilis commented Jan 24, 2025 •

edited

Loading

pchila commented Jan 28, 2025 •

edited

Loading

pkoutsovasilis Jan 28, 2025 •

edited

Loading

belimawr Jan 28, 2025 •

edited

Loading

pchila Jan 28, 2025 •

edited

Loading

cmacknz Jan 29, 2025 •

edited

Loading