-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
errors: narrow error types accepted by HistoryListener
#1159
Conversation
See the following report for details: cargo semver-checks output
|
6a52f4d
to
71539ec
Compare
71539ec
to
d7c8c96
Compare
v2:
|
d7c8c96
to
4311591
Compare
Rebased on latest #1157 (with modules refactor) |
bba3986
to
4088051
Compare
4088051
to
eeef129
Compare
v2.1: added a regression test for the scenario where the driver would unnecessarily wait for |
eeef129
to
0f15e44
Compare
Rebased on main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review whole PR soon, but I'm posting this comment separately because it may be important. Correct me if I'm wrong, but commit "spec_exec: flatten return type of run_request_speculative_fiber" seems to misunderstand speculative execution logic and introduce a bug because of that.
As you noted, the old logic ignores the empty plan error. You changed it to return immediately, which I think is not correct.
Consider the following scenario:
- We start some speculative fibers, some of them are still executing (e.g. sent a request and waiting for an answer). They exhausted the plan (note that the plan is shared between fibers!)
- Timer goes off, we start another fiber
- This new fiber gets an empty plan, so returns immediately
- We should ignore this error, because not all existing fibers finished yet. You new logic would return immediately.
There is one more issue: let's say that in the above scenario the "empty plan" fiber was actually the last fiber (and so we should return an error now, because the execution definitely failed). The error we should return is not the "empty plan" error from this fiber, but the previous error. This is because empty plan error is just a symptom of all other executions failing.
This is what this code did in the old logic:
return last_error.unwrap_or({
Err(EMPTY_PLAN_ERROR)
});
This caused us to return EMPTY_PLAN_ERROR only if last_error was None. If we finished all executions (async_tasks.is_empty() && retries_remaining == 0
), and the error is None, that means no attempt returned an error, so there was no attempt at all, so the plan was empty. This is the only scenario where we want to return the empty plan error.
/// Selected node's connection pool is in invalid state. | ||
#[error("No connections in the pool: {0}")] | ||
ConnectionPoolError(#[from] ConnectionPoolError), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Huh, I wonder why this is here. Intuition tells me that we can do a retry (using ofc a retry policy, so potentially on another node) in such situation, in which case this would fall under the "RequestAttemptError". Why is this variant separated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question. I wanted to put it under RequestAttemptError
, but then I saw the comment: https://github.com/muzarski/scylla-rust-driver/blob/main/scylla/src/client/session.rs#L2033-L2035. It says that we should not log such error in metrics. By metrics, I understand the request execution history. If I introduced the corresponding variant in RequestAttemptError
, we once again would be passing a "too broad" error type to HistoryListener::log_attempt_error
- variant corresponding to ConnectionPoolError
would never be constructed.
scylla/src/errors.rs
Outdated
/// Failed to run a request within a provided client timeout. | ||
#[error( | ||
"Request execution exceeded a client timeout {}ms", | ||
std::time::Duration::as_millis(.0) | ||
)] | ||
RequestTimeout(std::time::Duration), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💭 "errors: add RequestTimeout variant to RequestError " adds a new variant to RequestError, but this variant is not constructed. I assume that future commits / PRs will construct it - for example by changing some function to return RequestError instead of QueryError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is constructed in the next commit here 4bae738#diff-8dcf6c908c350e5e0977344268a9ba9095466e548365d416b436c938a83e49d2R1980-R1982 (I hope that this link works as expected)
scylla/src/observability/history.rs
Outdated
history_collector.log_attempt_start(query_id, None, node1_addr()); | ||
history_collector.log_attempt_error( | ||
attempt_id, | ||
&QueryError::TimeoutError, | ||
&unexpected_response(CqlResponseKind::Ready), | ||
&RetryDecision::RetrySameNode(Some(Consistency::Quorum)), | ||
); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Why did you make the changes in this test? They seem to change the test quite significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a perfect example of why QueryError
was "too broad" for log_attempt_error
. QueryError::TimeoutError could never be passed to this method, because it's an error constructed in the higher layer of execution (https://github.com/muzarski/scylla-rust-driver/blob/main/scylla/src/client/session.rs#L1974-L1987) which is done after we log all attempt errors. Since we narrowed the error type passed to log_attempt_error
, I needed to adjust the test accordingly (there is no TimeoutError
variant in RequestAttemptError
).
| - Attempt #0 sent to 127.0.0.1:19042 | ||
| request send time: 2022-02-22 20:22:22 UTC | ||
| Error at 2022-02-22 20:22:22 UTC | ||
| Error: Timeout Error | ||
| Error: Received unexpected response from the server: READY. Expected RESULT or ERROR response. | ||
| Retry decision: RetrySameNode(Some(Quorum)) | ||
| | ||
| - Attempt #1 sent to 127.0.0.1:19042 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓ Here too, why this change?
You are 100% correct, this is a bug. Thanks for spotting it. After giving it a thought, I think that I should revert this change - i.e. I also wonder if we should handle scenarios such as presented in the regression test. In other words, not to unnecessarily wait for a retry if the following conditions are true:
Do you think that it makes sense to prevent from such scenarios? Or is it such rare scenario that it's not worth to introduce additional noise to the function (new variable, additional branching etc.) |
IIUC this would mostly help in a scenario where the plan was already exhausted, but there are still some speculative execution left to start ("retries_remaining > 0"). Maybe you have an idea on how to add this functionality while also rewriting this logic to be less dangerous? |
Exactly.
I 100% agree.
I can think about it, but this is probably out of scope of this PR. I'll open an issue and I believe this could even be addressed in post-1.0.0 driver. |
Random thought: I wonder (not for the first time) if Rust makes us developers more "lazy" because of how many problems it eliminates automatically. |
One more thing: could you provide a very short overview in the description on request-error related types that exist after the PR? |
8727ce3
to
e7a700d
Compare
Introduced "new" error type and adjusted session.rs, speculative_execution module and iterator module to this type. This error represents a definite request failure (after potential retries).
e7a700d
to
0ef191c
Compare
Introduced: - RequestTimeout(std::time::Duration) - for requests that timed out with provided client timeout - SchemaAgreementTimeout(std::time::Duration) - for schema agreement timeouts
RequestError will be passed to HistoryListener when the request either fails or times out.
- `log_attempt_error` will now accept an error representing a single request failure - namely `RequestAttemptError` - `log_query_error` will now accept RequestError, which represents a definite request failure. This is a superset of RequestAttemptError, as it also contains information about potential timeout, empty plan etc.
0ef191c
to
5e6c26b
Compare
v3:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine now. I'm not fully convinced about the division between QueryError (btw will it be renamed?), RequestError and RequestAttemptError. But that is something that may be better addressed by discussion at the office some day.
Yes, I'd like to rename it. We discussed it with @wprzytula and figured out that |
Ref: #519
New error types and variants
RequestError (reintroduced)
Error type returned by
run_request_speculative_fiber
. It represents either a definite request failure - last attempt error, connection pool error, or an empty plan error.Replaced QueryError::RequestTimeout
This error variant was not good for 2 reasons:
I decoupled this variant into two separate variants that provide more context.
RequestError::RequestTimeout
After refactoring the timeout errors in
QueryError
, I added the corresponding variant toRequestError
(one introduced in this PR) as well.HistoryListener
Narrowed error types passed to log_query_error and log_attempt_error.
log_query_error
(or ratherlog_request_error
after rename) now acceptsRequestError
.log_attempt_error
now acceptsRequestAttemptError
(representing an error of a single attempt).Replaced "query" mentions with "request"
I did the replacement in trait methods, docstrings, tests etc. I tried to break the changes into multiple commits.
Execution errors summary
Currently, in session layer we can distinguish 3 kind of errors:
RequestAttemptError
- a failure of a single attempt of request execution. It is used in public API inRetrySession::decide_should_retry
andHistoryListener::log_attempt_error
RequestError
- a definite failure of request execution, after all (speculative) retries etc. In addition toRequestAttemptError
it contains information about the empty plan and request timeout. It also containsConnectionPoolError
variant. This variant is not included inRequestAttemptError
, because we don't want to log such error in execution history - there is a comment suggesting that (is it still true, though?).RequestError
is used in public API inHistoryListener::log_request_error
.QueryError
- a top-level error returned bySession
methods. It holds some additional information - e.g. about the failure of request preparation - values serialization etc.Pre-review checklist
[ ] I added relevant tests for new features and bug fixes.[ ] I have adjusted the documentation in./docs/source/
.Fixes:
annotations to PR description.