-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add metric for kernel restarts #1241
base: main
Are you sure you want to change the base?
Conversation
labels: type = kernel name, source = "restarter" or "user"
nbclassic's shimming appears to do some weird stuff, causing this same module to get imported twice under different names, and the canonical name attempts to import the deprecated name. Will have to think about the best way to resolve that, since I think the current try: import except: define doesn't make sense if nbclassic is up-to-date and notebook is not present. |
I agree that |
I do like 'source' better, as that looks like something we can be more 'sure' of - we know the restarter restarted it, or the user restarted it. I feel 'trigger' implies more causality than exists. |
…notebook which happens when `notebook` is entirely not present, and results in the same module being imported twice, ensuring all metrics are always defined twice
I like it, except it's complicated by the notebook->jupyter server migration due to prometheus enforcing uniqueness at import time, so we can't change the notebook metric. I think it would probably be good to put a |
YESSS let's prefix it all. We do that for jupyterhub too. Maybe we leave the current ones as they are, just replicate them anew with new metrics with the prefix, and get rid of the old ones in a major point release? |
…for the kernel name field
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #1241 +/- ##
==========================================
+ Coverage 79.11% 80.33% +1.22%
==========================================
Files 68 68
Lines 8263 8274 +11
Branches 1600 1602 +2
==========================================
+ Hits 6537 6647 +110
+ Misses 1304 1203 -101
- Partials 422 424 +2
... and 4 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Yeah, I think that makes sense. For now, I just updated the new metric to follow that pattern, and I can do a separate PR to deprecate the old metrics. |
Should it be jupyter_ or jupyter_server_? |
I literally changed it 3 times while writing it. I don't know! |
Thanks for this PR @minrk. I think a "standard" Regarding "Should the metric include the kernel ID?", I think all metrics should include a "subject" indicator when applicable (and not considered PII) and, in this case, kernel_id is extremely applicable. I could definitely see cluster admins needing to correlate the "source = restarter" restart metric to a particular kernel (and therefore user) when they see resources being depleted on various nodes because the kernel's restart (due to OOM) is bouncing around the cluster. |
@kevin-bates thanks! For metrics, prometheus recommends avoiding labels with too much churn, because each unique value for a label can be costly. I think this might be a case where monitoring (prometheus) metrics, and something like a Jupyter server Event might have different levels of detail (I think we should indeed have both for restart):
kernel IDs being UUIDs, and unique for every kernel across time seems to fit this description (kubernetes pod name also fits this, but while it's technically unbounded, it usually grows relatively slowly on the order of prometheus data retention) So my inclination is to exclude the kernel id for now.
Identifying the server is definitely useful. Metrics from each instance are still stored separately, but I think not identified by default. I believe prometheus scrape config can add the server as a label via e.g. |
I'm also on the fence for whether to use |
IMO, I've somewhat strong feelings about |
1 similar comment
IMO, I've somewhat strong feelings about |
I didn't realize prometheus had this "pattern/footprint" behavior and was going to ask about events as well. Given prometheus' recommendation I would agree that not including the kernel id makes sense - although much less helpful from a diagnostics perspective. The corresponding event will need to include the kernel_id. Since the prometheus namespace is flat using
It makes sense to use the package name as the prefix although I'm not ever going to propose a prefix of |
labels: kernel_name = kernel name, source = "restarter" or "user"
I don't love
type
as a label, but it's already used in running kernels, so it's probably more important that it match other metrics than it be the more logicalname
orkernel_name
, used elsewhere in the API. Same goes forsource
to distinguish API ('user') restarts from auto restarts by the KernelRestarter ('restarter'). I also consideredtrigger=user|crash
(or 'exit' instead of crash, which is slightly more precise).closes #1240