-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14739 cart: Add SWIM stats for delay/glitches #13587
Conversation
Bug-tracker data: |
71dd2ab
to
dcb0388
Compare
dcb0388
to
25f538d
Compare
Test stage Functional on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13587/3/testReport/ |
Apologies for the churn on this. I should have waited until it had passed tests before requesting review. I think that there are some telemetry ftests that need to be updated. |
25f538d
to
fad2ee1
Compare
7e9017f
to
4ec6d18
Compare
Functional on EL 9 Test Results (old)133 tests 129 ✅ 1h 14m 44s ⏱️ Results for commit 8b138e6. ♻️ This comment has been updated with latest results. |
Functional on EL 8.8 Test Results (old)133 tests 129 ✅ 1h 12m 7s ⏱️ Results for commit 8b138e6. ♻️ This comment has been updated with latest results. |
4ec6d18
to
cce2084
Compare
Functional Hardware Medium Test Results (old)131 tests 105 ✅ 2h 8m 7s ⏱️ Results for commit 2c03b7e. ♻️ This comment has been updated with latest results. |
Functional Hardware Large Test Results (old)64 tests 63 ✅ 40m 3s ⏱️ For more details on these errors, see this check. Results for commit cce2084. |
Functional Hardware Medium Verbs Provider Test Results (old)55 tests 53 ✅ 4h 36m 0s ⏱️ For more details on these errors, see this check. Results for commit 2c03b7e. ♻️ This comment has been updated with latest results. |
@@ -1053,12 +1074,15 @@ static int64_t crt_swim_progress_cb(crt_context_t crt_ctx, int64_t timeout_us, v | |||
swim_net_glitch_update(csm->csm_ctx, self_id, delay); | |||
csm->csm_last_unpack_hlc = hlc2; | |||
} | |||
crt_metrics_sample_delay(crt_ctx, delay, delay > max_delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that delay
does not mean "network delay". It roughly indicates the duration from the most recent incoming request among all local crt
contexts to the current time. Is that what you are after?
For something roughly related to "network delay", as seen by the swim
context, subjected to clock offsets, may be rcv_delay
in crt_swim_srv_cb
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what you are after?
To be honest, I don't have a specific goal here other than to expose more of the inner workings of DAOS via telemetry. As a first step, I have been looking for anomalous log entries (e.g. the above "Network outage detected") that are often seen when things aren't going well, and trying to add telemetry that would allow an admin to correlate a spike on a graph with an increase in those log entries. The idea is to try to discern patterns in telemetry across a cluster rather than having to sift through logs.
To that end, if you or @frostedcmos have suggestions for more/better telemetry I could add here, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] trying to add telemetry that would allow an admin to correlate a spike on a graph with an increase in those log entries.
Then I think the delay
here is likely what you're after. Perhaps some name better than "network delay"? To throw a stone, say, "SWIM inactivity"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To throw a stone, say, "SWIM inactivity"?
Hmm. At least for this first pass of adding telemetry, I am a little wary of trying to encode too much interpretation that results in the names of metrics diverging from the names in the code. That having been said, now that I step back and look at it a little, I agree that it would probably be best to specify that this metric is SWIM-specific, rather than a more generic measurement of networking performance. I'll update the name.
Ah yes. Nice catch. I guess you noticed that the previously discussed/required GHA update for distro versions has landed. How useful did you find the GHA run results? Please do feel free to file tickets on usability or any other issues with the GHA results. One usability issue that has been on my mind is to create an artifact of the test results (the typical daos logs, Avocado logs, etc.) for (just) all failed tests to avoid having to download (possibly multiple) large artifacts just to get at failed test logs. |
The results were much easier to look at than what I remember from the last time I tried. Not sure what changed. Being able to see the failed test name in the GH UI makes it simple to search JIRA for likely culprits. And in the case of a legitimate failure, I was able to see the output showing me which test I needed to fix.
Yes, that would be very useful. In the case of the failed hw tests (which were known failures), the raw output is very difficult to inspect because it's all squashed into a single line of text. |
Change-Id: Ifb1099fefde2ba58c7ef14e3b178cd65ceefa097
Run-GHA: true Change-Id: Ia56bc0afaad80adb3a6b85b3256d6f3d11c30ec4 Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>
ba8c1c8
to
8b138e6
Compare
That's good news.
Also good news.
Do you have an example, to see if I can improve that somehow? Is it good enough that you are likely to continue to use it? Have you shared your findings with any colleagues (thinking @jolivier23 here for example) to see if they might find it useful also? |
Yep, from the hw-large test results above:
If it continues to work, sure! I have mentioned to Jeff that it seems to be working better now; not sure that he's tried it out yet, though. I think having access to the test artifacts will be necessary to diagnose all but the most trivial of failures, though. |
Ahhh yes. I had not seen a test that timed out yet. That is some unfortunately non-pretty-printed JSON. I could attempt to massage the JUnit result in such a case, but I wonder if a better case can be made for Avocado to actually pretty-print that so that everyone benefits from it.
That's great to hear.
Which artifacts are missing? |
Change-Id: I020f2c79bcab29defddbd945400ddb6d4707b6e7
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-13587/11/testReport/ |
Change-Id: Ice13a5fbb693acdcedf6ce316baf8eb3457f3c75 Required-githooks: true
Find a different way to handle this scenario. Use-GHA: true Change-Id: I21764333bb9e9834a63e4c53c990307da457be6f Required-githooks: true Change-Id: I8eda1ce5a7c446303e74d69f9b3705ed02a0a1db Signed-off-by: Michael MacDonald <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ftest code changes LGTM
@frostedcmos Nudge... Sorry for the churn on this, just need another +1 and this can land, finally. Thanks. |
Adds a gauge to measure SWIM delay and a counter for glitches (temporary network outages). Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67 Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>
Adds a gauge to measure SWIM delay and a counter for glitches (temporary network outages). Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67 Required-githooks: true Change-Id: I854937dd249ad9f7211a3b7d40d3365a3e2f79f2 Signed-off-by: Michael MacDonald <[email protected]>
Adds a gauge to measure SWIM delay and a counter for glitches (temporary network outages). Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67 Required-githooks: true Change-Id: I854937dd249ad9f7211a3b7d40d3365a3e2f79f2 Signed-off-by: Michael MacDonald <[email protected]>
Adds a gauge to measure network delay and a counter
for glitches (temporary network outages).
Adds a small fix to GHA runner so that it doesn't exit due to
an undefined variable error.
Change-Id: I285e8806f9650eed10b1027f7c4c78755dfe7263
Required-githooks: true
Signed-off-by: Michael MacDonald [email protected]