Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch k8s input paths to /var/log/pods/* to ingest rotated container logs #6583

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pchila
Copy link
Member

@pchila pchila commented Jan 23, 2025

What does this PR do?

This PR changes the input definition for ingesting k8s container logs from /var/log/containers/* to /var/log/pods/* which contains all available logs (including rotated ones) for pods running on the k8s node.

Before this change elastic-agent would configure the input using the currently running container id and locating the corresponding log file under /var/log/containers/* which would point to the current log file, ignoring previously rotated log files for the container.

This change only impacts standalone agents. To have the same behaviour for managed agents a separate PR must be opened for the kubernetes integration in https://github.com/elastic/integrations/tree/main/packages/kubernetes

Why is it important?

This allow ingesting logs also for containers that rotate logs quickly (for example containers that are crashing very early in their execution) instad of trying to ingest only the last log file for the container.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • [ ] I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jan 23, 2025
@pchila pchila self-assigned this Jan 23, 2025
Copy link
Contributor

mergify bot commented Jan 23, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Jan 23, 2025

backport-v8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 23, 2025
@cmacknz
Copy link
Member

cmacknz commented Jan 23, 2025

@blakerouse
Copy link
Contributor

@pchila @pkoutsovasilis Do we need to worry about this change causing the files to be re-indexed? How filebeat stores files in the registry is not 100% clear in my mind, I would hope that it would not do that, but maybe?

@cmacknz
Copy link
Member

cmacknz commented Jan 23, 2025

@pchila @pkoutsovasilis Do we need to worry about this change causing the files to be re-indexed? How filebeat stores files in the registry is not 100% clear in my mind, I would hope that it would not do that, but maybe?

It shouldn't https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2

The default is either device ID + inode but we are switching everything to fingerprint because inode reuse is more common than we originally thought. CC @rdner in case there is some subtlety here that I'm missing.

@pkoutsovasilis
Copy link
Contributor

It shouldn't https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2

The default is either device ID + inode but we are switching everything to fingerprint because inode reuse is more common than we originally thought. CC @rdner in case there is some subtlety here that I'm missing.

Just echoing the same message; AFAICT nothing should be re-indexed as fingerprint is enabled in filestream 🙂

@@ -16,12 +16,12 @@ Config input for container logs
namespace: {{ .Values.kubernetes.namespace }}
use_output: {{ .Values.kubernetes.output }}
streams:
- id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
- id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}
data_stream:
dataset: kubernetes.container_logs
type: logs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as this translates to a running filestream input and we don't switch from 8.x to 9.x the change in the PR should cause no re-indexing.

Starting with 9.0 we will have fingerprint as a default file identity which is based on the file content, not its filesystem metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that an upgrade from 8.x to 9.x will result in a re-index of all files? That would be very bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that an upgrade from 8.x to 9.x will result in a re-index of all files? That would be very bad.

No, we also implemented a feature to migrate the state from the old default (native that uses inode + device ID) to the new default fingerprint. If you're curious, here is the PR.

@rdner
Copy link
Member

rdner commented Jan 24, 2025

Just echoing the same message; AFAICT nothing should be re-indexed as fingerprint is enabled in filestream 🙂

@pkoutsovasilis

There is a problem with the template you linked, fingerprint is default now but this template expects the opposite, probably needs to be changed.

@pkoutsovasilis
Copy link
Contributor

pkoutsovasilis commented Jan 24, 2025

@pkoutsovasilis

There is a problem with the template you linked, fingerprint is default now but this template expects the opposite, probably needs to be changed.

@rdner I believe you refer to {{#if useFingerprint}} part which by default is true even for the template, thus I would say that the template expects this to be enabled, what am I missing? 🙂

@rdner
Copy link
Member

rdner commented Jan 24, 2025

@pkoutsovasilis since 9.0 fingerprint is enabled by default. If the template has useFingerprint = false it will still be enabled. I don't think this is what expected there.

@pchila
Copy link
Member Author

pchila commented Jan 28, 2025

@rdner Re: fingerprint mechanism I performed a quick test with a container that would output a very short log before exiting and I noticed that having fingerprint enabled (with default length of 1024) would result in logs not being ingested even after the container terminated... Is this a known issue/limitation/design feature ?

Edit: forgot to add some details about the test

  • elastic-package stack with 9.0.0-SNAPSHOT version
  • mingrammer/flog deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: flog
  name: flog
spec:
  replicas: 1
  selector:
    matchLabels:
      app: flog
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: flog
    spec:
      containers:
      - image: mingrammer/flog:latest
        name: log
        args:
          - flog
          - -b
          - "512"
          - -f
          - json
      restartPolicy: Always
  • running on kind version 0.26.0, k8s v1.32.0

@rdner
Copy link
Member

rdner commented Jan 28, 2025

@pchila yes it's a known and documented limitation.

It's adjustable down to 64 bytes, but a smaller size would increase the risk of a hash collision.

Now the fingerprint is enabled by default perhaps we want to reconsider the default size, what do you think @belimawr @cmacknz ?

@@ -16,12 +16,12 @@ Config input for container logs
namespace: {{ .Values.kubernetes.namespace }}
use_output: {{ .Values.kubernetes.output }}
streams:
- id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}
- id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new ID does not seem to be unique, this will cause problems and lead to data duplication.
Are you sure that
${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name} will always be unique and will make this block be rendered only once?

The previous implementation used the ${kubernetes.container.id} because that's guaranteed to always be unique.

Suggested change
- id: kubernetes-container-logs-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name}
- id: kubernetes-container-logs-${kubernetes.pod.name}-${kubernetes.container.id}

The key thing we have to ensure here is that there will never be two Filestream inputs started with the same ID.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this file is in the helm chart folder, how does upgrade here work? Is there any change an user will upgrade and have the input ID changing? If so, that will lead to data duplication because the new ID will make all files be considered new, thus fully re-ingested.

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@belimawr some things I have in mind, the combination of ${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name} should always be unique in a k8s cluster, right?!

also about the

how does upgrade here work

indeed when we restart shouldn't fingerprint be the same no matter the ID?! that said, if we have to do it then we should do it now that helm chart is not GA yet?! 🙂

Copy link
Contributor

@belimawr belimawr Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the combination of ${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.name} should always be unique in a k8s cluster, right?!

I hope so, however I'm not sure. It would be nice to back it with some K8s documentation. I know for sure ${kubernetes.container.id} is always unique, hence it's the one I've always used/recommended.

indeed when we restart shouldn't fingerprint be the same no matter the ID?!

Yes, however, the input ID is part of the state ID, hence changing the input ID will affect the final state of the file.

Within a single instance of the Filestream input, files are identified by the file_identity, in this case it's fingerpring, however the Filestream input will only look at states that match it's ID.

Here is an example of an registry (file state) entry:

{
  "k": "filestream::my-filestream-id::native::8787338-65029",
  "v": {
    "ttl": 1800000000000,
    "updated": [
      516104569337,
      1653410356
    ],
    "cursor": {
      "offset": 29
    },
    "meta": {
      "source": "/home/n/go/src/github.com/elastic/beats/filebeat/test.log",
      "identifier_name": "native"
    }
  }
}

Even if the file is the same, changing the input ID will change the key: "k": "filestream::my-filestream-id::native::8787338-65029".

That's actually one of the benefits of Filestream over Log input, because the input ID is part of the state ID, we can have different instances of the input harvesting the same file as long as the input ID is different.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoting from k8s docs https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/#when-to-use-multiple-namespaces

Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces.
hence, given a namespace + podname pair it's guaranteed to be unique in the cluster (as opposed to the previous id that was NOT namespaced)

Within a pod spec container names must be unique so once the pod is created the tuple (namespace, pod name, container name) is unique and the input definition does not change with container restarts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking the docs and clarifying @pchila !!

Then ignore the change I suggested.

The only issue that persists is the ID changing when the helm chart is upgraded... That will cause re-ingestion of files, even with fingerprint.

@@ -354,7 +354,7 @@ data:
condition: '${host.platform} == ''windows'''
ignore_older: 72h
# Input ID allowing Elastic Agent to track the state of this input. Must be unique.
- id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}
- id: container-log-${kubernetes.namespace}-${kubernetes.pod.name}-${kubernetes.container.id}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here and in the next ones about updating the ID. I see it is in the kustomize folder, however I'm not 100% sure how this file is used. Is there any chance that during an (automated) upgrade a user will go from the old ID to the new one?

Copy link
Member Author

@pchila pchila Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the "old school" way of deploying agent in kubernetes... For the uniqueness see the previous response.

Yes it's very likely that either upgrading the agent or upgrading the kubernetes integration we may switch from one input to the other.

The files however should be the same as the /var/log/container/... are symlinks to the real log files in /var/log/pods/... so we should actually read the same file since the symlinks are dereferenced if prospector.scanner.symlinks: true is specified. Does this mean that filebeat will not keep track of the actual file metadata in the registry ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that filebeat will not keep track of the actual file metadata in the registry ?

TL;DR: Not in the way you expect, if you change the input ID, that input will re-ingest all files.

Longer version:
The registry for Filestream is "namespaced" by the input ID, an instance of the Filestream input will only have access to entries in the registry that match its ID. See the registry entry I mentioned on #6583 (comment). The state key is

filestream::my-filestream-id::native::8787338-65029

The format is:

filestream::<input ID>::<file identity name>::<state ID>

On the example above:

  • Input ID: my-filestream-id
  • File identity name: native
  • State ID: 8787338-65029.

If any of those change, it's a new entry in the registry and the file is re-ingested.

We can jump on a call tomorrow if you have any questions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested switching from the new to the old input id using the standalone k8s manifest and I observed a reingestion of container logs because of the change of input id.

Changing datastream id does not look like it triggers any re-ingestion.

This means that people upgrading from agents installed using plain k8s manifests to installing agent with the helm chart will experience the same reingestion since the input id is different there as well 😞

To be clear this will happen only once during the switch and only for the logs that are present on the node at that moment.

@nimarezainia @cmacknz @jlind23 what is your opinion on reingestion due to filestream input id change? Is it acceptable to reingest "once" when the input id changes (when we switch from the old input definition to the new or from the plan k8s manifest to the helm chart) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this change then? Why not just keep it the same? ${kubernetes.container.id} is always unique.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on my previous response: if the input id changes and the path matches all the available logs for the container in the pod (including the rotated ones) we will have log duplication at every container restart.

If we don't target all the logs for the container in the pod under /var/log/pods/* we are gonna miss rotated logs and not implement a solution for #5164

Copy link
Member

@cmacknz cmacknz Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing logs is worse than one time duplication of logs IMO. We changed the ID before in the past as it was not unique and there wasn't much backlash.

If we are convinced this is 100% more correct it is probably worth doing. If we are really worried about duplication only make the change in 9.0 where the file identity is changing anyway.

The way most hosted k8s providers do log rotation, we are looking at re-ingesting 10 MB of logs per node at worst. This is because the log rotation is 5 10 MB files but the oldest 4 are gzipped compressed and we can't read them. We can only read the file actively being written to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, thanks for the explination @pchila.

Is it possible to tell filebeat that it used to have an old ID so it can use it as a basis and then migrate to the new ID? I don't know if that is possible or if its hard to add. Maybe having something like:

registry:
  previous_id: container-log-${kubernetes.pod.name}-${kubernetes.container.id}

Just a thought to provide the best experience for our customers. Doing this in 9.0 only is probably safest, but I think we would need to place a note in the release notes that upgrade from 8.x to 9.x would result in a re-ingestion of all container logs. I think that is just something that is crazy to accept for an upgrade path.... but it is a path...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blake I both thought of the previous_id idea at about the same time (#6583 (comment)), I think we should definitely explore this.

I created elastic/beats#42472 to track and discuss this idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for context, when we change the file identity in 9.0 there will be no data re-ingestion because we can migrate the state from native (the current default) to fingerprint (the new default), that was all implemented by elastic/beats#41762.

Regarding the previous_id idea, I'll comment on the issue itself.

@belimawr
Copy link
Contributor

It shouldn't https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_file_identity_2
The default is either device ID + inode but we are switching everything to fingerprint because inode reuse is more common than we originally thought. CC @rdner in case there is some subtlety here that I'm missing.

Just echoing the same message; AFAICT nothing should be re-indexed as fingerprint is enabled in filestream 🙂

While that is correct, it is still incomplete. The input ID is part of the state ID, so changing the input ID (I mean the Filestream input ID, not the Elastic-Agent unit/input ID) will cause all files to be re-ingested regardless of the file identity used.

There is no way to migrate state from one input instance to another.

@blakerouse's concern is very valid and real here

@pchila @pkoutsovasilis Do we need to worry about this change causing the files to be re-indexed?

If we change the IDs, like we're doing in deploy/helm/elastic-agent/templates/integrations/_kubernetes/_kubernetes_logs_containers.tpl files will be re-ingested.

The question for me is more about how (if any) an update happens. My understanding of helm charts is that they can be updated, however if we're only talking about static examples in our documentation (like our example manifest linked in our docs), then there is no issue because users won't be downloading/updating their manifests every new release. Well, that's what I'd expect users to do.

@pchila pchila force-pushed the ingest-rotated-container-logs branch from d3b1872 to 7e4b93d Compare January 29, 2025 13:34
Copy link

Quality Gate passed Quality Gate passed

Issues
0 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarQube

@cmacknz
Copy link
Member

cmacknz commented Jan 29, 2025

There is no way to migrate state from one input instance to another.

We could build one if we really needed it. We could add an old_id field with the old pattern and then we could have the input pick up from there and write updates with the new one or something. This isn't the first time we've needed to change the input ID in the k8s integration so maybe it's worth thinking about this.

@cmacknz
Copy link
Member

cmacknz commented Jan 29, 2025

Created elastic/beats#42472 to track the idea of just telling the filestream input what its old ID was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[k8s provider] Look up metadata based on file path
6 participants