Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[filebeat] Elasticsearch state storage for httpjson and cel inputs #41446

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

aleksmaus
Copy link
Member

@aleksmaus aleksmaus commented Oct 24, 2024

Proposed commit message

[filebeat] Elasticsearch state storage for httpjson input

This is a POC for Elasticsearch as State Store Backend for Security Integrations for Agentless solution.

The scope of this change was narrowed down to supporting only httpjson inputs in order to support Okta integration for the initial release. All the other integrations inputs still use the file storage as before.
This is a short term solution for the state storage for k8s environment.

This is the first cut and the details can change depending on the feedback.

Current feature currently could be enabled AGENTLESS_ELASTICSEARCH_STATE_STORE_ENABLED, to be decided how this would be configurable in k8s.

This change currently contains the hacky approach to the AGENTLESS_ELASTICSEARCH_APIKEY overwrite. This allows to the user to provide the ApiKey with elevated permissions that are required in order to be able to create/write/read the state index per input. THIS IS FOR DEVELOPMENT/TESTING ONLY. REMOVE BEFORE THE MERGE.

The existing code relied on the inputs state storage to be fully configurable before the main beat managers runs. The change delays the configuration of httpjson input to the time when the actual configuration is received from the Agent.

There is an assumption that the index template for the state storage indices is already in place before the storage is used

PUT _index_template/agentless_state_template
{
  "index_patterns": [
    "agentless-state-*"
  ],
  "priority": 300,
  "template": {
    "mappings": {
      "properties": {
        "v": {
          "type": "object",
          "enabled": false
        },
        "updated_at": {
          "type": "date",
          "format": "strict_date_optional_time||epoch_millis"
        }
      }
    },
    "settings": {
      "number_of_shards": 1
    }
  }
}

Example of the state storage index content for Okta integration:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "agentless-state-httpjson-okta.system-028ecf4b-babe-44c6-939e-9e3096af6959",
        "_id": "httpjson::httpjson-okta.system-028ecf4b-babe-44c6-939e-9e3096af6959::https://dev-36006609.okta.com/api/v1/logs",
        "_seq_no": 39,
        "_primary_term": 1,
        "_score": 1,
        "_source": {
          "v": {
            "ttl": 1800000000000,
            "updated": "2024-10-24T20:21:22.032Z",
            "cursor": {
              "published": "2024-10-24T20:19:53.542Z"
            }
          }
        }
      }
    ]
  }
}

The naming convention for all state store is agentless-state-<input id>, since the expectation for agentless we would have only one agent per policy and the agents are ephemeral.

Currently in order to run the agent with Elasticsearch state storage a couple of environment variables would be required:

sudo AGENTLESS_ELASTICSEARCH_STATE_STORE_ENABLED=1 AGENTLESS_ELASTICSEARCH_APIKEY=xxxxxxxx-xvpDXfB:jVMRsW7SRIxxxxxxxxx ./elastic-agent -e

where the ApiKey in the

DEPENDENCIES / TODOS:

  • Approval of teams for this approach
  • Kibana (?) side change is required for the agentless-state index template boostrapping
  • Kibana or the intergration package (or both) change is required in order to include the permissions for agentless-state- with the Elasticsearch ApiKey (Remove the hack). I suspect that Kibana fleet code could be modified to recognize agentless supporting integration and include the proper index name for the agentless-state for the ApiKey permissions.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

The change should have no impact, and without the feature enabled the filebeat should work as before using the file system storage for the state.

@aleksmaus aleksmaus self-assigned this Oct 24, 2024
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2024
@aleksmaus aleksmaus added the Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution label Oct 24, 2024
Copy link
Contributor

mergify bot commented Oct 24, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @aleksmaus? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Oct 24, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Oct 24, 2024
@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 24, 2024
@aleksmaus aleksmaus changed the title [filebeat] Elasticsearch state storage for httpjson input [filebeat] Elasticsearch state storage for httpjson and cel inputs Oct 30, 2024
@aleksmaus
Copy link
Member Author

@belimawr @cmacknz (or whoever wants/have time to be involved)
I need your feedback on this draft, if this approach is something that we could eventually merge (the ApiKey workaround will be removed once we adjust the kibana fleet).
I think this is an ok solution given the circumstances:

  1. This is fully backwards compatible. If the feature is not enabled, everything would work as before.
  2. The only inputs that are enabled for Elasticsearch backed state storage are the httpjson and cel. We only enabling the limited number of integration relying on httpjson or cel inputs for the first release.
  3. The state initialization for the inputs is delayed until we get the configuration only if the feature is enabled for the input.
  4. The agent logs monitoring will still use the local storage, since we agreed that loosing the agent log when the pod relocated is acceptable.

@cmacknz
Copy link
Member

cmacknz commented Nov 1, 2024

@leehinman I'd appreciate a review here to make sure this can co-exist with Beats receivers in agent since that would be the long term way we plan to run agentless inputs.

@cmacknz cmacknz requested a review from leehinman November 1, 2024 20:23

// List of input types Elasticsearch state store is enabled for
var esTypesEnabled = map[string]void{
"httpjson": {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be configuration instead of in the code, maybe another env var?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure can do. Something like this?
AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES=httpjson,cel

Comment on lines +37 to +45
func New(ctx context.Context, log *logp.Logger, notifier *Notifier) (*Registry, error) {
r := &Registry{
ctx: ctx,
log: log,
notifier: notifier,
}

return r, nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error return can be removed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave for now as is, as it's consistent with memlog.New that returns the registry and in case if further changes that could return error from the constructor

…asticsearch state store is enabled if any of the inputs are with AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES.
@aleksmaus
Copy link
Member Author

Eliminated AGENTLESS_ELASTICSEARCH_STATE_STORE_ENABLED environment variable setting, it seemed a bit redundant.
Now Elasticsearch state storage is enabled if AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES is set.
Her is an example of shorter command to run the agent now:

sudo AGENTLESS_ELASTICSEARCH_STATE_STORE_INPUT_TYPES="httpjson,cel" AGENTLESS_ELASTICSEARCH_APIKEY=fsOitZIBVlcA-mvxxxxx:jVMRsW7SRIOc-U6VHxxxxx ./elastic-agent -e

It will be event shorter when the APIKEY permissions are properly set through the policy and we can remove that workaround with APIKEY injection.

@aleksmaus
Copy link
Member Author

Can I get some reviews on this PR? What is acceptable? What is missing?

Assuming that this "hack" https://github.com/elastic/beats/pull/41446/files#diff-97b65fad069be1072219d7ae6a1e8f64d287e8cb6c1f4e424c801a22200db104R318
will be removed in the final version, when we have the proper permissions changes on Kibana side.

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest thing missing now is tests. There needs to be an integration test proving this actually works, ideally against ES running in a container.

I would also like to see some tests that drive the Beat configuration as if it were managed by Elastic agent. There are a couple that start Filebeat with management enabled but using a gRPC server that is under the test control instead of the real agent, they also run with ES running already. https://github.com/cmacknz/beats/blob/d4fc5cdeff6d2bc382f0341fe9118157f6fcb849/x-pack/filebeat/tests/integration/managerV2_test.go#L56-L70


type OnConfigUpdateFunc func(c *conf.C)

type Notifier struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests awhile back. @cmacknz could you please take a look again?

func (n *Notifier) Unsubscribe(id int) {
n.mx.Lock()
defer n.mx.Unlock()
delete(n.listeners, id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this decrement n.counter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counter here is just the id (handle) that is returned to the user on Subscribe in order to allow to that "handle" later for Unsubscribe.

}
s.cliErr = nil

cli, err := eslegclient.NewConnectedClient(ctx, c, s.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean there is a client and connection per input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per instance of the storage, yes. Do we want to make it singleton/global?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly because of this piece:

// NewConnectedClient returns a non-thread-safe connection. Make sure for each goroutine you initialize a new connection.

// NewConnectedClient returns a non-thread-safe connection. Make sure for each goroutine you initialize a new connection.
func NewConnectedClient(ctx context.Context, cfg *cfg.C, beatname string) (*Connection, error) {


// Sets the store ID when the full input configuration is acquired
// This is needed in order to support Elasticsearch state store naming convention based on the input ID
SetID(id string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment implies this means there will be a store per input, would this make more sense being a required argument when the store is constructed? You can't have an interface that includes a constructor, but that is where it feels like it belongs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to workaround the current code interfaces that are already in place here. That's why had to add additional function to be able to set the ID when the input configuration is actually received and the ID is available.

libbeat/statestore/backend/es/store.go Show resolved Hide resolved
"query": map[string]any{
"match_all": map[string]any{},
},
"size": 1000, // TODO: we might have to do scroll if there are more than 1000 keys
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely are we to have more than 1000 keys? The inputs wouldn't have had any kind of restriction on the number of keys before, so hard to tell if this won't matter for a while or is critically important to fix.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the input types that I'm testing with so far is very very unlikely. For example for httpjson input, it's usually one cursor key/value per stream.

Copy link
Contributor

mergify bot commented Dec 3, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b poc/es_state_store upstream/poc/es_state_store
git merge upstream/main
git push upstream poc/es_state_store

@aleksmaus aleksmaus requested a review from a team as a code owner December 10, 2024 20:59
Copy link
Contributor

mergify bot commented Dec 10, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b poc/es_state_store upstream/poc/es_state_store
git merge upstream/main
git push upstream poc/es_state_store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team Team:Security-Deployment and Devices Deployment and Devices Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants