-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for concurrent requests to a single peer. #200
Conversation
15b5ed7
to
9cc6c79
Compare
Thanks @divagant-martian for the review! I wasn't too familiar with using
|
15afbcc
to
7ae82df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review for now
src/handler/active_requests.rs
Outdated
.position(|req| req.packet().message_nonce() == nonce) | ||
.expect("to find request call by nonce"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back when I suggested to use a expect
the case was first finding the position and then a get
. In that case it was ok since we assume the standard library ensures us position
is correct.
Here however, this means the RequestCall
matching the nonce
is always present, which it should if we don't have bugs. Since we are treating our (potential) bugs as None
this shouldn't be expect
d, but treated instead exactly as the Vacant
variant
src/handler/active_requests.rs
Outdated
Entry::Vacant(_) => panic!("invalid ActiveRequests state"), | ||
Entry::Occupied(mut requests) => { | ||
let index = requests | ||
.get() | ||
.iter() | ||
.position(|req| req.packet().message_nonce() == &nonce) | ||
.expect("to find request call by nonce"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to no panic in those to keep in line with gracefully handling potential inconsistencies
5947e0c
to
6bece13
Compare
I'm testing this PR with a simulation. 🙂 The sequence diagram below shows what happens on session expirations. sequenceDiagram
participant Node1
participant Node2
Note over Node1: Start discv5 server
Note over Node2: Start discv5 server
Node1 ->> Node2: message
Node2 ->> Node1: message
Note over Node1,Node2: Session established
rect rgb(100, 100, 0)
Note over Node2: ** Session expired **
end
rect rgb(10, 10, 10)
Note left of Node1: Node1 sends two requests<br> **in parallel**.
par
Node1 ->> Node2: FINDNODE (id:1)
and
Node1 -->> Node2: FINDNODE (id:2)
end
end
Note over Node2: Send WHOAREYOU<br>since the session has been expired
Node2 ->> Node1: WHOAREYOU (id:1)
rect rgb(100, 100, 0)
Note over Node2: Drop FINDNODE(id:2) request<br>since WHOAREYOU already sent.
end
Note over Node1: New session established with Node2
Node1 ->> Node2: Handshake message (id:1)
Note over Node2: New session established with Node1
Node2 ->> Node1: NODES (id:1)
rect rgb(100, 0, 0)
Note over Node1: ** FINDNODE(id:2) request timed out **
end
I observed that the
|
a766c21
to
b079c01
Compare
@ackintosh Thanks for the review! Made the suggested changes. Now, all active requests are replayed if an active session is re-established. Tested it out locally with testground and it appears to have solved the Btw, curious as to how you're finding testground? Looks like a pretty useful tool and something that we might want to use in trin. |
@njgheorghita Thanks! I will review the changes. Yeah testground is so helpful. It's easy to simulate/reproduce particular situation/problem with testground. We are also using testground for the gossipsub protocol https://github.com/sigp/gossipsub-testground. I hope this can be of reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is progressing in the right direction but there are some bugs to be fixed
src/handler/active_requests.rs
Outdated
node_address: &NodeAddress, | ||
id: &RequestId, | ||
) -> Option<RequestCall> { | ||
match self.active_requests_mapping.entry(node_address.to_owned()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no difference between clone
and to_owned
for NodeContact
since it's owned version it of the same type. Let's use clone
which is easier to understand in general
match self.active_requests_mapping.entry(node_address.to_owned()) { | |
match self.active_requests_mapping.entry(node_address.clone()) { |
.remove(request_call.packet().message_nonce()); | ||
Some(request_call) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to check the len
of requests
. Otherwise you will end up with an entry with an empty list
src/handler/active_requests.rs
Outdated
Entry::Occupied(mut requests) => { | ||
match requests | ||
.get() | ||
.iter() | ||
.position(|req| req.packet().message_nonce() == &nonce) | ||
{ | ||
Some(index) => Poll::Ready(Some(Ok(( | ||
node_address, | ||
requests.get_mut().remove(index), | ||
)))), | ||
None => Poll::Ready(None), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, this will leave requests with empty lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will leave requests with empty lists
Not sure I'm understanding your suggestion here. Do you want to see an empty requests replaced with Entry::Vacant
after removing the last request? Based on your suggestion on line 89, it seems like the convention is that it is possible for the active_requests_mapping
entry for a node to be an empty list instead of strictly Entry::Vacant
.
In the case where requests
is an empty list, this logic will simply return None => Poll::Ready(None)
. There doesn't seem to be a need to handle the empty list case any differently? But I could be missing something.
But if you'd like to see the change where the active_requests_mapping
entry for a node without any active requests cannot be an empty list and must always be Entry::Vacant
, I can make those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Empty list should always be vacant. Otherwise we can let this grow unbounded on keys because we forgot to remove stale ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, on empty list use entry.remove()
src/handler/mod.rs
Outdated
if let Some(current_session) = self.sessions.get_mut(&node_address) { | ||
current_session.update(session); | ||
self.replay_active_requests::<P>(&node_address).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes like this deserve a comment.
I need to think this one a bit more but the session stores the old keys in case they are needed, so I'm not sure this is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment, but lmk if you feel like this needs to be removed. Although just to mention that this was the change that solved the FindNode
requests timeout error described above, but if this is unnecessary I'll debug that failure some more and find a more appropriate solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout has been solved by this change however I noticed another issue about this replaying:
As highlighted by the red square below, replay_active_requests()
re-sends the FINDNODE(id:1)
that has already been sent in the Handshake message. Active requests that should be re-sent is only FINDNODE(id:2)
in the case of this example.
sequenceDiagram
participant Node1
participant Node2
Note over Node1: Start discv5 server
Note over Node2: Start discv5 server
Node1 ->> Node2: message
Node2 ->> Node1: message
Note over Node1,Node2: Session established
rect rgb(100, 100, 0)
Note over Node2: ** Session expired **
end
rect rgb(10, 10, 10)
Note left of Node1: Node1 sends two requests<br> **in parallel**.
par
Node1 ->> Node2: FINDNODE (id:1)
and
Node1 -->> Node2: FINDNODE (id:2)
end
end
Note over Node2: Send WHOAREYOU<br>since the session has been expired
Node2 ->> Node1: WHOAREYOU (id:1)
rect rgb(100, 100, 0)
Note over Node2: Drop FINDNODE(id:2) request<br>since WHOAREYOU already sent.
end
Note over Node1: New session established with Node2
Node1 ->> Node2: Handshake message (id:1)
Note over Node2: New session established with Node1
rect rgb(10, 10, 10)
Note left of Node1: Handler::replay_active_requests()
rect rgb(100, 0, 0)
Node1 ->> Node2: FINDNODE (id:1) <br>**duplicates with Handshake message (id:1)**
end
Node1 -->> Node2: FINDNODE (id:2)
end
Node2 ->> Node1: NODES (id:1)
Node2 -->> Node1: NODES (id:2)
rect rgb(100, 0, 0)
Node2 ->> Node1: NODES (id:1)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the diagram it was super helpful. I believe I've implemented a sufficient solution that filters out the handshake request when replaying active requests. But only in the case when Node2
sends the whoareyou
. I'm not sure if I took the right approach here, but lmk if you see anything that needs changing.
src/handler/mod.rs
Outdated
Some(Ok((node_address, _challenge))) = self.active_challenges.next() => { | ||
// A challenge has expired. There could be pending requests awaiting this | ||
// challenge. We process them here | ||
self.send_next_request::<P>(node_address).await; | ||
// A challenge has expired. There could be active requests impacted by this | ||
// challenge. We replay them here | ||
self.replay_active_requests::<P>(&node_address).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to send pending_requests here not active_requests.
Current implementation, requests in pending_requests remain without being sent in case of WHOAREYOU timeout:
sequenceDiagram
participant Node1
participant Node2
Node2 ->> Node1: Random packet
Note over Node1: active_challenges.insert()
Node1 ->> Node2: WHOAREYOU
rect rgb(100, 100, 0)
Note over Node2: ** Discard the WHOAREYOU packet<br>to reproduce kind of network issue. **
end
rect rgb(10, 10, 10)
Note left of Node1: Node1 want to send FINDNODE to Node2<br>but active_challenge exists.<br>So insert requests into pending_requests.
par
Note over Node1: pending_requests.insert()
and
Note over Node1: pending_requests.insert()
end
end
Note over Node1: The challenge in active_challenges<br>has been expired.
rect rgb(100, 0, 0)
Note over Node1: FINDNODE requests remain in pending_requests without being sent.
end
4c361d9
to
49dde30
Compare
Just a note that I've been experiencing some flaky behavior while trying to use this branch inside trin. Looks like there could be a problem with how requests are queued for a node before the session is established. Will update here once I've found a solution. |
@njgheorghita Could you please merge current master into this branch? There're some fixes. I'm not sure if the fixes relates to that problem though. |
@ackintosh Doesn't seem to have fixed the flakiness, I should have some time in the next day or two to further investigate what's going on there and I'll report back here |
@njgheorghita I'm working on adding a test for I don't have permission to push commits to this PR directly, so I have created a PR to your |
afe7466
to
76b88d9
Compare
After the fix has merged, I had run this on a live network and verified it works as expected. |
Ok, @ackintosh we think this ready for a merge? I'll do one last quick review if so. |
@AgeManning Yes, please go ahead with the final review. 🙂 |
This seems like it's been well tested. I think the best approach now is to merge it in. We're hoping to get better visibility on errors that occur is discovery, with #218 I think we should make a release with both of these PRs in and we will be able to test at scale in Lighthouse also. |
@divagant-martian - Want to do a final review, happy with a merge then live test? |
sure, I''l review this friday! |
As soon as the review is approved lets merge this :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mostly happy with the current state. Looks like the happy paths are appropriately handled. There are some places where I'd like to do some changes: Active requests can probably be changed to use a Smallvec
instead of a Vec
with a reasonable low value. Some places need a bit of love in handling the unhappy paths since we do things like this:
fn operation_that_should_be_atomic() {
op1_that_works(); // this change remains even if op2 fails
op2_that_fails()?;
}
but in all fairness the affected code already required some love.
That being said, I don't want to delay this further. The base is good, now time to battle test it.
@AgeManning please note that there is one required changed before merging, since there is an unwrap in the code
src/handler/mod.rs
Outdated
{ | ||
let new_packet = session | ||
.encrypt_message::<P>(self.node_id, &request_call.encode()) | ||
.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda believe this but we don't really use unwrap
in prod code (release builds). Can you please change this to handle the error or at least change it to an expect
with the right explanation?
This is my only requirement before merging @njgheorghita
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njgheorghita - I had a quick look through the AES library we use. From what I can tell this function can fail if the size of the messages are over a specific limit. I was looking at version 0.9. I haven't looked at future versions.
I don't think it's reasonable to assume that in the current version and all possible future versions, that every possible message we try to encrypt will be successful. If there is an event that it's not successful, I dont think we should crash the entire application.
Failing to encrypt a message is not a catastrophic failure imo. I think it's fine to just log a CRIT message and continue running.
In which case we shouldn't use an unwrap()
or expect()
rather check the failure and if it fails log a crit.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts...
- We're only dealing with requests retrieved from the
active_requests
store. IIUC this implies a pretty strong guarantee that the encryption will be successful, since the request has already been successfully encrypted when it originated. If this is accurate, thenexpect
might actually expose an error in the request-handling logic - though, I'm unsure whether or not that is worth crashing... - If the encryption fails, do we want to do more than just log a CRIT message? Maybe it's worth removing the request entirely from the
active_requests
store. Although... maybe it's alright to just leave the (unable to encrypt) active request as is and let it simply timeout?
Failing to encrypt a message is not a catastrophic failure imo.
Ultimately, I agree with this. So I'm leaning towards logging a message. The question I would need to think more about (it's been a while since I've been intimate with this code 😅 ) is whether or not we need to handle un-encryptable requests? Or if it's ok to just log the message and let the request timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I more or less believed this could be true is because we are replaying. In that scenario, the appropriate message in expect
is why this is infallible, not simply the error occurred. Maybe could have been more clear about it.
But it seems to be replaying might no be enough guarantee from @AgeManning 's comment? In that case the error needs to be handled. Here that potentially means bobbling up the error to the application. You should look into how failed requests and failed encryption in particular is handled in other parts of the code to get a good idea of what to do in this scenario.
Also, yes to logging the error (there is no CRIT, just ERROR)
Or if it's ok to just log the message and let the request timeout
Would need to check but I assume we try to guarantee that if a request times out then it is active, removing from one place and not the other would most likely break that invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah all good points.
My personal preference with edge cases like this, is to just log the highest kind of error. Hopefully we have good enough monitoring tools to notice it. I tend not to try to engineer wild error cases. If we hit this, I think its fine to just let the request timeout. Usually we should always send an error back to the service, so the service knows that the request broke, but I'm not sure if its even worth doing that.
This should be a very very low probability of occurring. If things start breaking because we dont handle this error case correctly, we should hopefully be able to identify the problem from the error log.
Anyway, feel free to handle how you like, but I think just taking the easy path and logging an error should be enough.
@AgeManning @divagant-martian @ackintosh Thanks everyone so much for the help on this pr! I believe all the issues have been addressed & fixed, but if there are any other final changes that you'd like to see, just lmk! |
* Version bump to v0.3.1 (#203) * fix docs (#204) * refactor for removing `Discv5` prefix (#206) * goodbye prefix * adjust docs * fix rpc decoding to reject extra data (#208) * reject extra data * reduce diff * expected_responses remains after challenge has been completed (#210) * Replace `Handler::spawn` with `build_handler()` and `Handler::start()` * Test the handler's states after the handler has been terminated * Remove expected response on handle_auth_message() * Rename variables for readability * Expose local ENR Arc (#214) * Use zero ports in tests (#216) * update dependencies (#219) * Changing port of ReponseBody::Pong to NonZeroU16 (#220) * Change `port` from u16 to NonZeroU16 * Fix tests * Fix test: the PONG port can't be zero * Fix clippy warnings * Update 'enr' dependency (#223) * Add support for concurrent requests to a single peer. (#200) Co-authored-by: ackintosh <[email protected]> Co-authored-by: Diva M <[email protected]> Co-authored-by: Age Manning <[email protected]> * Adjust some logs (#225) * remove log for timed out query. This is always informed in the callback * expand common logs, unify info on startup * adjust auth header log * Update src/service.rs * Appease clippy * Realised I was wrong. Don't need this log, my bad * fmt --------- Co-authored-by: Age Manning <[email protected]> * Version bump to v0.4.0 * make tracing-subscriber a dev dep (#226) * Fix warnings and bump libp2p (#232) * Update session_cache_capacity from usize to NonZeroUsize since the argument of LruCache::new is NonZeroUsize. * Fix rustdoc * cargo fmt * Fix a merging mistake: lost validation in Message::decode --------- Co-authored-by: Age Manning <[email protected]> Co-authored-by: Divma <[email protected]> Co-authored-by: Jack McPherson <[email protected]> Co-authored-by: João Oliveira <[email protected]> Co-authored-by: Milos Stankovic <[email protected]> Co-authored-by: Nick Gheorghita <[email protected]> Co-authored-by: Diva M <[email protected]>
Description
Add support to discv5 for multiple, concurrent network requests to a single peer. See #198
How was it fixed?
ActiveRequests
to support tracking multiple, concurrent, active requests for any given peer.HashMapDelay
to the nonce mapping since it's no longer effective on the peer mapping.ActiveRequests
methods.Notes & open questions
active_requests
methods to return aResult
rather than anOption
. This made sense to me to accurately reflect the cases where the mappings might be out of sync. However, I realize that theHashMap
&HashMapDelay
apis return anOption
, so I don't feel strongly about this change either way.Change checklist