-
Notifications
You must be signed in to change notification settings - Fork 164
docs inconsistency: Histograms doesn't include grpc_code #75
Comments
It'd be nice to hear @bwplotka's opinion as well, but I feel having the |
Mixed feelings as well, as cardinality might be the problem for almost no gain. On the other hand IMO knowing latency of error request is useful, but not sure if exact info of timings between 202 and 200 or 504 between 500 matters... Probably in some specific cases matters (: Also with Do you need this @NBR41 or just mismatch between docs and implementation is what you want to fix? I think PR to fix docs is must-have anyway. |
Agreed. |
no i've not this particular need for now. |
Cool, let's correct docs for now, thank You for finding this! Marking as a docs bug. |
I think it's not unreasonable for succeeded and failed calls to behave quite differently; in other words failed calls often are outliers, latency wise. For example when a deadline is exceeded somewhere, failed calls will tend to be vastly slower than regular calls. In other cases, errors will be detected early (e.g. bad requests, failed preconditions) and thus artificially skewing the distribution towards a fat head. A more specific example: If I see that the 95% percentile duration for a call is 1s, does this mean that it's sometimes slow but serving OK, or that those slow requests are the ones that hit some deadline exceeded errors downstream? Looking at error ratio with grpc_server_handled_total can give me a hint, but it alone cannot fully answer that since the method could perform multiple downstream calls, some of which could have a very short deadline and thus cause the outer method to be counted in a small bucket with its grpc_code="deadline_exceeded" EDIT: I'm not very concerned with the cardinality increase, after all the number of possible grpc error codes is bounded (unlike the method names for which we already have a label); But knowing whether it's "OK" vs "other" would already be useful. |
IMO it should have a label with SUCCESS/FAILURE values at least in |
I'm also a bit surprised about this issue, for golden signals it is usually recommended to analyze latency for ok/error separately:
The cardinality argument doesn't seem very strong given that there are only a small number of status codes. I would appreciate if this could be fixed. |
Hi,
unlike the documentation in the README file, it appairs that the grpc_code is not included as label in the histogram.
The grpc_code is not in the predefined labels.
Is there a way to add grpc_code, or maybe it would be nice to correct the README ?
Thanks
The text was updated successfully, but these errors were encountered: