Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otel: fix flakiness and various issues in TestFBOtelRestartE2E #6819

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

mauri870
Copy link
Member

@mauri870 mauri870 commented Feb 11, 2025

What does this PR do?

This test starts the collector with a timeout, but the error returned is not
always a context cancelled, sometimes it returns err == nil, which is also
fine, just not handled properly.

While at it, fix some other issues I found while testing:

  • Remove the requirement that an ignored field cannot be equal in both documents. There are cases for agent.version where it matches on main but not in 8.x or 9.0.
  • Using require inside a goroutine calls runtime.GoExit on failure, meaning
    the test exits immediatelly without doing any cleanup, causing resource leaks. Use assert in those
    cases.
  • Now with the beats dependency up to date, deduplication works as intended otelconsumer: set document id attribute for elasticsearchexporter beats#42412. Update the test to use logs_dynamic_id in the elasticsearchexporter options and ensure data is deduplicated in Elasticsearch.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Related issues

@mauri870 mauri870 added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-8.x Automated backport to the 8.x branch with mergify backport-9.0 Automated backport to the 9.0 branch labels Feb 11, 2025
@mauri870 mauri870 self-assigned this Feb 11, 2025
@mauri870 mauri870 requested a review from a team as a code owner February 11, 2025 16:56
@mauri870 mauri870 requested review from swiatekm and pchila February 11, 2025 16:56
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@mauri870 mauri870 changed the title otel: adjust TestFBOtelRestartE2E to validate deduplication works otel: adjust TestFBOtelRestartE2E to validate that deduplication works Feb 11, 2025
@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Feb 11, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@mauri870 mauri870 marked this pull request as draft February 12, 2025 11:31
@mauri870
Copy link
Member Author

mauri870 commented Feb 12, 2025

Moving this to draft since it requires work done in beats via elastic/beats#42412 . I need to bump the beats dependency in go.mod #6837.

@mauri870 mauri870 force-pushed the otel-restart-test-duplicates branch from dc95b0a to d04606e Compare February 19, 2025 20:34
@mauri870 mauri870 changed the title otel: adjust TestFBOtelRestartE2E to validate that deduplication works otel: fix flaky behavior in TestFBOtelRestartE2E Feb 19, 2025
@mauri870 mauri870 changed the title otel: fix flaky behavior in TestFBOtelRestartE2E otel: fix flakiness and various issues in TestFBOtelRestartE2E Feb 19, 2025
@mauri870 mauri870 marked this pull request as ready for review February 19, 2025 20:41
@mauri870
Copy link
Member Author

mauri870 commented Feb 19, 2025

I'm repurposing this PR to include a series of fixes for the otel tests. Having the fixes as a batch as oposed to separate PRs speeds up the continuous integration builds.

@mauri870 mauri870 requested a review from swiatekm February 19, 2025 20:43
This test starts the collector with a timeout, but the error returned is not
always a context cancelled, sometimes it returns err == nil, which is also
fine, just not handled properly.

While at it, fix some other issues I found while testing:

- Using require inside a goroutine calls runtime.GoExit on failure, meaning
  the test exits immediatelly without doing any cleanup. Use assert in those
  cases.
@mauri870 mauri870 force-pushed the otel-restart-test-duplicates branch from d04606e to 95c25c9 Compare February 20, 2025 11:29
@mauri870 mauri870 enabled auto-merge (squash) February 20, 2025 12:51
@mauri870 mauri870 marked this pull request as draft February 20, 2025 18:51
auto-merge was automatically disabled February 20, 2025 18:51

Pull request was converted to draft

@mauri870 mauri870 force-pushed the otel-restart-test-duplicates branch from ce3fb39 to ac211c6 Compare February 21, 2025 16:39
@mauri870 mauri870 marked this pull request as ready for review February 21, 2025 16:50
@mauri870 mauri870 enabled auto-merge (squash) February 21, 2025 16:50
@mauri870
Copy link
Member Author

/test

@jlind23
Copy link
Contributor

jlind23 commented Feb 26, 2025

fleet airgapped tests are the one failing and are known issues, retriggering the tests.

@mauri870
Copy link
Member Author

fleet airgapped tests are the one failing and are known issues, retriggering the tests.

Thanks. I'm yet to see a test run that succeeded in the last week or so. It seems that every single retry the same exact set of fleet tests fail.

@elasticmachine
Copy link
Contributor

elasticmachine commented Feb 26, 2025

⏳ Build in-progress, with failures

Failed CI Steps

History

cc @mauri870

Copy link

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitpicks on the assertion functions, overall it looks good

_, err = inputFile.Write([]byte("\n"))
require.NoErrorf(t, err, "failed to write newline to temp file")
assert.NoErrorf(t, err, "failed to write newline to temp file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert.NoErrorf(t, err, "failed to write newline to temp file")
assert.NoError(t, err, "failed to write newline to temp file")

@@ -1519,7 +1513,7 @@ service:
go func() {
err = fixture.RunOtelWithClient(fCtx)
cancel()
require.True(t, errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled), "unexpected error: %v", err)
assert.True(t, err == nil || errors.Is(err, context.DeadlineExceeded) || errors.Is(err, context.Canceled), "unexpected error: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since we are not testing a boolean we can use https://pkg.go.dev/github.com/stretchr/testify/assert#Conditionf with a small function where we can also do some additional logging if necessary

Comment on lines +1577 to +1578
_, found = uniqueIngestedLogs[msg]
require.False(t, found, "found duplicated log message %q", msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any specific reason why we can't use assert.NotContains here?

Suggested change
_, found = uniqueIngestedLogs[msg]
require.False(t, found, "found duplicated log message %q", msg)
require.NotContainsf(uniqueIngestedLogs, msg, "found duplicated log message %q", msg)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, I just didn't knew about this function. TIL.

@mauri870 mauri870 marked this pull request as draft February 26, 2025 14:27
auto-merge was automatically disabled February 26, 2025 14:27

Pull request was converted to draft

@mauri870
Copy link
Member Author

Leaving this as draft since we had to remove all the tests as a part of a last ditch fix for EDOT in v9.0. I will revisit this once #7023 is reverted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-9.0 Automated backport to the 9.0 branch skip-changelog Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flaky Test]: TestFBOtelRestartE2E – expected the collector to have stopped
8 participants