DAOS-14739 cart: Add SWIM stats for delay/glitches #13587

mjmac · 2024-01-10T15:35:12Z

Adds a gauge to measure network delay and a counter
for glitches (temporary network outages).

Adds a small fix to GHA runner so that it doesn't exit due to
an undefined variable error.

Change-Id: I285e8806f9650eed10b1027f7c4c78755dfe7263
Required-githooks: true
Signed-off-by: Michael MacDonald [email protected]

github-actions · 2024-01-10T15:35:52Z

Bug-tracker data:
Ticket title is 'Expanded pool/container metrics'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-14739

daosbuild1 · 2024-01-10T18:50:53Z

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13587/3/testReport/

mjmac · 2024-01-11T15:43:56Z

Apologies for the churn on this. I should have waited until it had passed tests before requesting review. I think that there are some telemetry ftests that need to be updated.

github-actions · 2024-01-17T20:45:46Z

Functional on EL 9 Test Results (old)

133 tests 129 ✅ 1h 14m 44s ⏱️
40 suites 4 💤
40 files 0 ❌

Results for commit 8b138e6.

♻️ This comment has been updated with latest results.

github-actions · 2024-01-17T20:52:50Z

Functional on EL 8.8 Test Results (old)

133 tests 129 ✅ 1h 12m 7s ⏱️
40 suites 4 💤
40 files 0 ❌

Results for commit 8b138e6.

♻️ This comment has been updated with latest results.

github-actions · 2024-01-18T09:49:48Z

Functional Hardware Medium Test Results (old)

131 tests 105 ✅ 2h 8m 7s ⏱️
34 suites 26 💤
34 files 0 ❌

Results for commit 2c03b7e.

♻️ This comment has been updated with latest results.

github-actions · 2024-01-18T10:34:04Z

Functional Hardware Large Test Results (old)

64 tests 63 ✅ 40m 3s ⏱️
14 suites 0 💤
14 files 0 ❌ 1 🔥

For more details on these errors, see this check.

Results for commit cce2084.

github-actions · 2024-01-18T13:10:10Z

Functional Hardware Medium Verbs Provider Test Results (old)

55 tests 53 ✅ 4h 36m 0s ⏱️
7 suites 1 💤
7 files 0 ❌ 1 🔥

For more details on these errors, see this check.

Results for commit 2c03b7e.

♻️ This comment has been updated with latest results.

liw · 2024-01-22T07:19:50Z

src/cart/crt_swim.c

@@ -1053,12 +1074,15 @@ static int64_t crt_swim_progress_cb(crt_context_t crt_ctx, int64_t timeout_us, v
 				swim_net_glitch_update(csm->csm_ctx, self_id, delay);
 				csm->csm_last_unpack_hlc = hlc2;
 			}
+			crt_metrics_sample_delay(crt_ctx, delay, delay > max_delay);


My understanding is that delay does not mean "network delay". It roughly indicates the duration from the most recent incoming request among all local crt contexts to the current time. Is that what you are after?

For something roughly related to "network delay", as seen by the swim context, subjected to clock offsets, may be rcv_delay in crt_swim_srv_cb.

Is that what you are after?

To be honest, I don't have a specific goal here other than to expose more of the inner workings of DAOS via telemetry. As a first step, I have been looking for anomalous log entries (e.g. the above "Network outage detected") that are often seen when things aren't going well, and trying to add telemetry that would allow an admin to correlate a spike on a graph with an increase in those log entries. The idea is to try to discern patterns in telemetry across a cluster rather than having to sift through logs.

To that end, if you or @frostedcmos have suggestions for more/better telemetry I could add here, please let me know.

[...] trying to add telemetry that would allow an admin to correlate a spike on a graph with an increase in those log entries.

Then I think the delay here is likely what you're after. Perhaps some name better than "network delay"? To throw a stone, say, "SWIM inactivity"?

To throw a stone, say, "SWIM inactivity"?

Hmm. At least for this first pass of adding telemetry, I am a little wary of trying to encode too much interpretation that results in the names of metrics diverging from the names in the code. That having been said, now that I step back and look at it a little, I agree that it would probably be best to specify that this metric is SWIM-specific, rather than a more generic measurement of networking performance. I'll update the name.

brianjmurrell · 2024-01-22T16:14:35Z

@brianjmurrell: NB, there's a fix in this commit that allows the GHA testing to work.

Ah yes. Nice catch.

I guess you noticed that the previously discussed/required GHA update for distro versions has landed.

How useful did you find the GHA run results? Please do feel free to file tickets on usability or any other issues with the GHA results.

One usability issue that has been on my mind is to create an artifact of the test results (the typical daos logs, Avocado logs, etc.) for (just) all failed tests to avoid having to download (possibly multiple) large artifacts just to get at failed test logs.

mjmac · 2024-01-23T19:58:05Z

How useful did you find the GHA run results? Please do feel free to file tickets on usability or any other issues with the GHA results.

The results were much easier to look at than what I remember from the last time I tried. Not sure what changed. Being able to see the failed test name in the GH UI makes it simple to search JIRA for likely culprits. And in the case of a legitimate failure, I was able to see the output showing me which test I needed to fix.

One usability issue that has been on my mind is to create an artifact of the test results (the typical daos logs, Avocado logs, etc.) for (just) all failed tests to avoid having to download (possibly multiple) large artifacts just to get at failed test logs.

Yes, that would be very useful. In the case of the failed hw tests (which were known failures), the raw output is very difficult to inspect because it's all squashed into a single line of text.

Change-Id: Ifb1099fefde2ba58c7ef14e3b178cd65ceefa097

Run-GHA: true Change-Id: Ia56bc0afaad80adb3a6b85b3256d6f3d11c30ec4 Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>

brianjmurrell · 2024-01-23T21:01:43Z

The results were much easier to look at than what I remember from the last time I tried.

That's good news.

And in the case of a legitimate failure, I was able to see the output showing me which test I needed to fix.

Also good news.

In the case of the failed hw tests (which were known failures), the raw output is very difficult to inspect because it's all squashed into a single line of text.

Do you have an example, to see if I can improve that somehow?

Is it good enough that you are likely to continue to use it? Have you shared your findings with any colleagues (thinking @jolivier23 here for example) to see if they might find it useful also?

mjmac · 2024-01-23T21:10:39Z

Do you have an example, to see if I can improve that somehow?

Yep, from the hw-large test results above:

Functional Hardware Large Test Results (old)

64 tests 63 ✅ 40m 3s ⏱️ 14 suites 0 💤 14 files 0 ❌ 1 🔥

For more details on these errors, see this check.

Results for commit cce2084.

Is it good enough that you are likely to continue to use it?

If it continues to work, sure! I have mentioned to Jeff that it seems to be working better now; not sure that he's tried it out yet, though. I think having access to the test artifacts will be necessary to diagnose all but the most trivial of failures, though.

brianjmurrell · 2024-01-23T23:01:01Z

For more details on these errors, see this check.

Ahhh yes. I had not seen a test that timed out yet. That is some unfortunately non-pretty-printed JSON.

I could attempt to massage the JUnit result in such a case, but I wonder if a better case can be made for Avocado to actually pretty-print that so that everyone benefits from it.

If it continues to work, sure!

That's great to hear.

I think having access to the test artifacts will be necessary to diagnose all but the most trivial of failures, though.

Which artifacts are missing?

src/gurt/telemetry.c

src/cart/crt_swim.c

Change-Id: I020f2c79bcab29defddbd945400ddb6d4707b6e7

daosbuild1 · 2024-01-27T14:40:38Z

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13587/11/testReport/

Change-Id: Ice13a5fbb693acdcedf6ce316baf8eb3457f3c75 Required-githooks: true

Find a different way to handle this scenario. Use-GHA: true Change-Id: I21764333bb9e9834a63e4c53c990307da457be6f Required-githooks: true Change-Id: I8eda1ce5a7c446303e74d69f9b3705ed02a0a1db Signed-off-by: Michael MacDonald <[email protected]>

daltonbohning

ftest code changes LGTM

mjmac · 2024-02-08T18:13:26Z

@frostedcmos Nudge... Sorry for the churn on this, just need another +1 and this can land, finally. Thanks.

Adds a gauge to measure SWIM delay and a counter for glitches (temporary network outages). Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67 Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>

Adds a gauge to measure SWIM delay and a counter for glitches (temporary network outages). Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67 Required-githooks: true Change-Id: I854937dd249ad9f7211a3b7d40d3365a3e2f79f2 Signed-off-by: Michael MacDonald <[email protected]>

mjmac requested a review from a team as a code owner January 10, 2024 15:35

mjmac requested review from liw, chowes and frostedcmos January 10, 2024 15:39

mjmac force-pushed the mjmac/DAOS-14739 branch from 71dd2ab to dcb0388 Compare January 10, 2024 15:40

mjmac removed request for liw, chowes and frostedcmos January 10, 2024 15:46

mjmac marked this pull request as draft January 10, 2024 15:46

mjmac force-pushed the mjmac/DAOS-14739 branch from dcb0388 to 25f538d Compare January 10, 2024 15:51

frostedcmos previously approved these changes Jan 10, 2024

View reviewed changes

mjmac marked this pull request as ready for review January 10, 2024 16:38

mjmac requested review from liw and chowes January 10, 2024 16:39

mjmac removed request for liw and chowes January 11, 2024 15:43

mjmac marked this pull request as draft January 11, 2024 15:44

mjmac force-pushed the mjmac/DAOS-14739 branch from 25f538d to fad2ee1 Compare January 17, 2024 18:47

mjmac dismissed frostedcmos’s stale review via 7e9017f January 17, 2024 19:52

mjmac force-pushed the mjmac/DAOS-14739 branch 2 times, most recently from 7e9017f to 4ec6d18 Compare January 17, 2024 20:30

mjmac force-pushed the mjmac/DAOS-14739 branch from 4ec6d18 to cce2084 Compare January 17, 2024 21:08

liw reviewed Jan 22, 2024

View reviewed changes

Merge branch 'master' into mjmac/DAOS-14739

339ec12

Change-Id: Ifb1099fefde2ba58c7ef14e3b178cd65ceefa097

mjmac dismissed frostedcmos’s stale review via ba8c1c8 January 23, 2024 20:05

Rename net/delay to net/swim_delay

8b138e6

Run-GHA: true Change-Id: Ia56bc0afaad80adb3a6b85b3256d6f3d11c30ec4 Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>

mjmac force-pushed the mjmac/DAOS-14739 branch from ba8c1c8 to 8b138e6 Compare January 23, 2024 20:05

frostedcmos previously approved these changes Jan 23, 2024

View reviewed changes

kjacque requested changes Jan 26, 2024

View reviewed changes

src/gurt/telemetry.c Outdated Show resolved Hide resolved

src/cart/crt_swim.c Outdated Show resolved Hide resolved

Merge branch 'master' into mjmac/DAOS-14739

5e52126

Change-Id: I020f2c79bcab29defddbd945400ddb6d4707b6e7

mjmac added 2 commits February 6, 2024 16:18

Merge branch 'master' into mjmac/DAOS-14739

720cb87

Change-Id: Ice13a5fbb693acdcedf6ce316baf8eb3457f3c75 Required-githooks: true

Remove d_tm_zero_gauge()

0da0bb0

Find a different way to handle this scenario. Use-GHA: true Change-Id: I21764333bb9e9834a63e4c53c990307da457be6f Required-githooks: true Change-Id: I8eda1ce5a7c446303e74d69f9b3705ed02a0a1db Signed-off-by: Michael MacDonald <[email protected]>

mjmac dismissed frostedcmos’s stale review via 0da0bb0 February 6, 2024 16:55

mjmac requested review from a team as code owners February 6, 2024 16:55

kjacque approved these changes Feb 6, 2024

View reviewed changes

daltonbohning reviewed Feb 6, 2024

View reviewed changes

mjmac requested a review from frostedcmos February 7, 2024 19:43

frostedcmos approved these changes Feb 9, 2024

View reviewed changes

mjmac merged commit dbc8923 into master Feb 9, 2024
48 checks passed

mjmac deleted the mjmac/DAOS-14739 branch February 9, 2024 16:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14739 cart: Add SWIM stats for delay/glitches #13587

DAOS-14739 cart: Add SWIM stats for delay/glitches #13587

mjmac commented Jan 10, 2024 •

edited

Loading

github-actions bot commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

mjmac commented Jan 11, 2024

github-actions bot commented Jan 17, 2024 •

edited

Loading

github-actions bot commented Jan 17, 2024 •

edited

Loading

github-actions bot commented Jan 18, 2024 •

edited

Loading

github-actions bot commented Jan 18, 2024

github-actions bot commented Jan 18, 2024 •

edited

Loading

liw Jan 22, 2024

mjmac Jan 22, 2024

liw Jan 23, 2024

mjmac Jan 23, 2024

brianjmurrell commented Jan 22, 2024

mjmac commented Jan 23, 2024

brianjmurrell commented Jan 23, 2024 •

edited

Loading

mjmac commented Jan 23, 2024 •

edited

Loading

Functional Hardware Large Test Results (old)

brianjmurrell commented Jan 23, 2024

daosbuild1 commented Jan 27, 2024

daltonbohning left a comment

mjmac commented Feb 8, 2024

DAOS-14739 cart: Add SWIM stats for delay/glitches #13587

DAOS-14739 cart: Add SWIM stats for delay/glitches #13587

Conversation

mjmac commented Jan 10, 2024 • edited Loading

github-actions bot commented Jan 10, 2024

daosbuild1 commented Jan 10, 2024

mjmac commented Jan 11, 2024

github-actions bot commented Jan 17, 2024 • edited Loading

Functional on EL 9 Test Results (old)

github-actions bot commented Jan 17, 2024 • edited Loading

Functional on EL 8.8 Test Results (old)

github-actions bot commented Jan 18, 2024 • edited Loading

Functional Hardware Medium Test Results (old)

github-actions bot commented Jan 18, 2024

Functional Hardware Large Test Results (old)

github-actions bot commented Jan 18, 2024 • edited Loading

Functional Hardware Medium Verbs Provider Test Results (old)

liw Jan 22, 2024

Choose a reason for hiding this comment

mjmac Jan 22, 2024

Choose a reason for hiding this comment

liw Jan 23, 2024

Choose a reason for hiding this comment

mjmac Jan 23, 2024

Choose a reason for hiding this comment

brianjmurrell commented Jan 22, 2024

mjmac commented Jan 23, 2024

brianjmurrell commented Jan 23, 2024 • edited Loading

mjmac commented Jan 23, 2024 • edited Loading

Functional Hardware Large Test Results (old)

brianjmurrell commented Jan 23, 2024

daosbuild1 commented Jan 27, 2024

daltonbohning left a comment

Choose a reason for hiding this comment

mjmac commented Feb 8, 2024

mjmac commented Jan 10, 2024 •

edited

Loading

github-actions bot commented Jan 17, 2024 •

edited

Loading

github-actions bot commented Jan 17, 2024 •

edited

Loading

github-actions bot commented Jan 18, 2024 •

edited

Loading

github-actions bot commented Jan 18, 2024 •

edited

Loading

brianjmurrell commented Jan 23, 2024 •

edited

Loading

mjmac commented Jan 23, 2024 •

edited

Loading