Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14739 cart: Add SWIM stats for delay/glitches #13587

Merged
merged 6 commits into from
Feb 9, 2024
Merged

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Jan 10, 2024

Adds a gauge to measure network delay and a counter
for glitches (temporary network outages).

Adds a small fix to GHA runner so that it doesn't exit due to
an undefined variable error.

Change-Id: I285e8806f9650eed10b1027f7c4c78755dfe7263
Required-githooks: true
Signed-off-by: Michael MacDonald [email protected]

@mjmac mjmac requested a review from a team as a code owner January 10, 2024 15:35
Copy link

Bug-tracker data:
Ticket title is 'Expanded pool/container metrics'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-14739

@mjmac mjmac requested review from liw, chowes and frostedcmos January 10, 2024 15:39
@mjmac mjmac marked this pull request as draft January 10, 2024 15:46
frostedcmos
frostedcmos previously approved these changes Jan 10, 2024
@mjmac mjmac marked this pull request as ready for review January 10, 2024 16:38
@mjmac mjmac requested review from liw and chowes January 10, 2024 16:39
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13587/3/testReport/

@mjmac mjmac removed request for liw and chowes January 11, 2024 15:43
@mjmac
Copy link
Contributor Author

mjmac commented Jan 11, 2024

Apologies for the churn on this. I should have waited until it had passed tests before requesting review. I think that there are some telemetry ftests that need to be updated.

@mjmac mjmac marked this pull request as draft January 11, 2024 15:44
@mjmac mjmac force-pushed the mjmac/DAOS-14739 branch 2 times, most recently from 7e9017f to 4ec6d18 Compare January 17, 2024 20:30
Copy link

github-actions bot commented Jan 17, 2024

Functional on EL 9 Test Results (old)

133 tests   129 ✅  1h 14m 44s ⏱️
 40 suites    4 💤
 40 files      0 ❌

Results for commit 8b138e6.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Jan 17, 2024

Functional on EL 8.8 Test Results (old)

133 tests   129 ✅  1h 12m 7s ⏱️
 40 suites    4 💤
 40 files      0 ❌

Results for commit 8b138e6.

♻️ This comment has been updated with latest results.

Copy link

github-actions bot commented Jan 18, 2024

Functional Hardware Medium Test Results (old)

131 tests   105 ✅  2h 8m 7s ⏱️
 34 suites   26 💤
 34 files      0 ❌

Results for commit 2c03b7e.

♻️ This comment has been updated with latest results.

Copy link

Functional Hardware Large Test Results (old)

64 tests   63 ✅  40m 3s ⏱️
14 suites   0 💤
14 files     0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit cce2084.

Copy link

github-actions bot commented Jan 18, 2024

Functional Hardware Medium Verbs Provider Test Results (old)

55 tests   53 ✅  4h 36m 0s ⏱️
 7 suites   1 💤
 7 files     0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit 2c03b7e.

♻️ This comment has been updated with latest results.

@@ -1053,12 +1074,15 @@ static int64_t crt_swim_progress_cb(crt_context_t crt_ctx, int64_t timeout_us, v
swim_net_glitch_update(csm->csm_ctx, self_id, delay);
csm->csm_last_unpack_hlc = hlc2;
}
crt_metrics_sample_delay(crt_ctx, delay, delay > max_delay);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that delay does not mean "network delay". It roughly indicates the duration from the most recent incoming request among all local crt contexts to the current time. Is that what you are after?

For something roughly related to "network delay", as seen by the swim context, subjected to clock offsets, may be rcv_delay in crt_swim_srv_cb.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what you are after?

To be honest, I don't have a specific goal here other than to expose more of the inner workings of DAOS via telemetry. As a first step, I have been looking for anomalous log entries (e.g. the above "Network outage detected") that are often seen when things aren't going well, and trying to add telemetry that would allow an admin to correlate a spike on a graph with an increase in those log entries. The idea is to try to discern patterns in telemetry across a cluster rather than having to sift through logs.

To that end, if you or @frostedcmos have suggestions for more/better telemetry I could add here, please let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[...] trying to add telemetry that would allow an admin to correlate a spike on a graph with an increase in those log entries.

Then I think the delay here is likely what you're after. Perhaps some name better than "network delay"? To throw a stone, say, "SWIM inactivity"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To throw a stone, say, "SWIM inactivity"?

Hmm. At least for this first pass of adding telemetry, I am a little wary of trying to encode too much interpretation that results in the names of metrics diverging from the names in the code. That having been said, now that I step back and look at it a little, I agree that it would probably be best to specify that this metric is SWIM-specific, rather than a more generic measurement of networking performance. I'll update the name.

@brianjmurrell
Copy link
Contributor

@brianjmurrell: NB, there's a fix in this commit that allows the GHA testing to work.

Ah yes. Nice catch.

I guess you noticed that the previously discussed/required GHA update for distro versions has landed.

How useful did you find the GHA run results? Please do feel free to file tickets on usability or any other issues with the GHA results.

One usability issue that has been on my mind is to create an artifact of the test results (the typical daos logs, Avocado logs, etc.) for (just) all failed tests to avoid having to download (possibly multiple) large artifacts just to get at failed test logs.

@mjmac
Copy link
Contributor Author

mjmac commented Jan 23, 2024

How useful did you find the GHA run results? Please do feel free to file tickets on usability or any other issues with the GHA results.

The results were much easier to look at than what I remember from the last time I tried. Not sure what changed. Being able to see the failed test name in the GH UI makes it simple to search JIRA for likely culprits. And in the case of a legitimate failure, I was able to see the output showing me which test I needed to fix.

One usability issue that has been on my mind is to create an artifact of the test results (the typical daos logs, Avocado logs, etc.) for (just) all failed tests to avoid having to download (possibly multiple) large artifacts just to get at failed test logs.

Yes, that would be very useful. In the case of the failed hw tests (which were known failures), the raw output is very difficult to inspect because it's all squashed into a single line of text.

Change-Id: Ifb1099fefde2ba58c7ef14e3b178cd65ceefa097
Run-GHA: true
Change-Id: Ia56bc0afaad80adb3a6b85b3256d6f3d11c30ec4
Required-githooks: true
Signed-off-by: Michael MacDonald <[email protected]>
frostedcmos
frostedcmos previously approved these changes Jan 23, 2024
@brianjmurrell
Copy link
Contributor

brianjmurrell commented Jan 23, 2024

The results were much easier to look at than what I remember from the last time I tried.

That's good news.

And in the case of a legitimate failure, I was able to see the output showing me which test I needed to fix.

Also good news.

In the case of the failed hw tests (which were known failures), the raw output is very difficult to inspect because it's all squashed into a single line of text.

Do you have an example, to see if I can improve that somehow?

Is it good enough that you are likely to continue to use it? Have you shared your findings with any colleagues (thinking @jolivier23 here for example) to see if they might find it useful also?

@mjmac
Copy link
Contributor Author

mjmac commented Jan 23, 2024

Do you have an example, to see if I can improve that somehow?

Yep, from the hw-large test results above:

Functional Hardware Large Test Results (old)

64 tests   63 ✅  40m 3s ⏱️ 14 suites   0 💤 14 files     0 ❌  1 🔥

For more details on these errors, see this check.

Results for commit cce2084.

Is it good enough that you are likely to continue to use it?

If it continues to work, sure! I have mentioned to Jeff that it seems to be working better now; not sure that he's tried it out yet, though. I think having access to the test artifacts will be necessary to diagnose all but the most trivial of failures, though.

@brianjmurrell
Copy link
Contributor

For more details on these errors, see this check.

Ahhh yes. I had not seen a test that timed out yet. That is some unfortunately non-pretty-printed JSON.

I could attempt to massage the JUnit result in such a case, but I wonder if a better case can be made for Avocado to actually pretty-print that so that everyone benefits from it.

If it continues to work, sure!

That's great to hear.

I think having access to the test artifacts will be necessary to diagnose all but the most trivial of failures, though.

Which artifacts are missing?

Change-Id: I020f2c79bcab29defddbd945400ddb6d4707b6e7
@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13587/11/testReport/

mjmac added 2 commits February 6, 2024 16:18
Change-Id: Ice13a5fbb693acdcedf6ce316baf8eb3457f3c75
Required-githooks: true
Find a different way to handle this scenario.

Use-GHA: true
Change-Id: I21764333bb9e9834a63e4c53c990307da457be6f
Required-githooks: true

Change-Id: I8eda1ce5a7c446303e74d69f9b3705ed02a0a1db
Signed-off-by: Michael MacDonald <[email protected]>
Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest code changes LGTM

@mjmac mjmac requested a review from frostedcmos February 7, 2024 19:43
@mjmac
Copy link
Contributor Author

mjmac commented Feb 8, 2024

@frostedcmos Nudge... Sorry for the churn on this, just need another +1 and this can land, finally. Thanks.

@mjmac mjmac merged commit dbc8923 into master Feb 9, 2024
48 checks passed
@mjmac mjmac deleted the mjmac/DAOS-14739 branch February 9, 2024 16:27
mjmac added a commit that referenced this pull request Feb 12, 2024
Adds a gauge to measure SWIM delay and a counter
for glitches (temporary network outages).

Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67
Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
jolivier23 pushed a commit that referenced this pull request Feb 28, 2024
Adds a gauge to measure SWIM delay and a counter
for glitches (temporary network outages).

Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67
Required-githooks: true

Change-Id: I854937dd249ad9f7211a3b7d40d3365a3e2f79f2
Signed-off-by: Michael MacDonald <[email protected]>
jolivier23 pushed a commit that referenced this pull request Apr 10, 2024
Adds a gauge to measure SWIM delay and a counter
for glitches (temporary network outages).

Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67
Required-githooks: true

Change-Id: I854937dd249ad9f7211a3b7d40d3365a3e2f79f2
Signed-off-by: Michael MacDonald <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

7 participants