Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for concurrent requests to a single peer. #200

Merged
merged 46 commits into from
Dec 10, 2023

Conversation

njgheorghita
Copy link
Contributor

@njgheorghita njgheorghita commented Jul 10, 2023

Description

Add support to discv5 for multiple, concurrent network requests to a single peer. See #198

How was it fixed?

  • Update ActiveRequests to support tracking multiple, concurrent, active requests for any given peer.
  • Moved HashMapDelay to the nonce mapping since it's no longer effective on the peer mapping.
  • If any single request fails for a peer, all other active requests for that peer are also failed.
  • Replay all active requests if a new challenge is issued.
  • Added tests for ActiveRequests methods.

Notes & open questions

  • Updated active_requests methods to return a Result rather than an Option. This made sense to me to accurately reflect the cases where the mappings might be out of sync. However, I realize that the HashMap & HashMapDelay apis return an Option, so I don't feel strongly about this change either way.
  • This was tested in trin, and resulted in 100% success rate for utp txs over discv5.

Change checklist

  • Self-review
  • Documentation updates if relevant
  • Tests if relevant

@njgheorghita njgheorghita marked this pull request as ready for review July 10, 2023 20:25
@divagant-martian divagant-martian self-requested a review July 18, 2023 23:20
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/mod.rs Outdated Show resolved Hide resolved
src/handler/mod.rs Outdated Show resolved Hide resolved
src/handler/mod.rs Outdated Show resolved Hide resolved
@njgheorghita njgheorghita force-pushed the experimental branch 5 times, most recently from 15b5ed7 to 9cc6c79 Compare July 19, 2023 21:16
@njgheorghita
Copy link
Contributor Author

Thanks @divagant-martian for the review! I wasn't too familiar with using entry() so thanks for pointing out where that would be useful.

  • I went back to the old pattern of returning an Option rather than a Result from ActiveRequests::remove_* methods. Imo it's just cleaner if we want to use entry() to mutate ActiveRequests. As well as it more closely represents the HashMap api. This means that the code functionally ignores scenarios when the ActiveRequests mappings are out of sync (besides logging some info). However, this was the behavior before this pr, so I don't consider it a problem.
  • I re-introduced pending_requests to queue requests that are made before a session is established

@njgheorghita njgheorghita force-pushed the experimental branch 4 times, most recently from 15afbcc to 7ae82df Compare July 20, 2023 14:14
Copy link
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review for now

src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
Comment on lines 51 to 52
.position(|req| req.packet().message_nonce() == nonce)
.expect("to find request call by nonce");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back when I suggested to use a expect the case was first finding the position and then a get. In that case it was ok since we assume the standard library ensures us position is correct.

Here however, this means the RequestCall matching the nonce is always present, which it should if we don't have bugs. Since we are treating our (potential) bugs as None this shouldn't be expectd, but treated instead exactly as the Vacant variant

src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Show resolved Hide resolved
src/handler/active_requests.rs Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
src/handler/active_requests.rs Outdated Show resolved Hide resolved
Comment on lines 140 to 146
Entry::Vacant(_) => panic!("invalid ActiveRequests state"),
Entry::Occupied(mut requests) => {
let index = requests
.get()
.iter()
.position(|req| req.packet().message_nonce() == &nonce)
.expect("to find request call by nonce");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to no panic in those to keep in line with gracefully handling potential inconsistencies

@ackintosh
Copy link
Member

I'm testing this PR with a simulation. 🙂 The sequence diagram below shows what happens on session expirations.

sequenceDiagram
    participant Node1
    participant Node2
    Note over Node1: Start discv5 server
    Note over Node2: Start discv5 server

    Node1 ->> Node2: message
    Node2 ->> Node1: message
    Note over Node1,Node2: Session established

    rect rgb(100, 100, 0)
    Note over Node2: ** Session expired **
    end

    rect rgb(10, 10, 10)
    Note left of Node1: Node1 sends two requests<br> **in parallel**.
    par 
    Node1 ->> Node2: FINDNODE (id:1)
    and 
    Node1 -->> Node2: FINDNODE (id:2)
    end
    end

    Note over Node2: Send WHOAREYOU<br>since the session has been expired
    Node2 ->> Node1: WHOAREYOU (id:1)

    rect rgb(100, 100, 0)
    Note over Node2: Drop FINDNODE(id:2) request<br>since WHOAREYOU already sent.
    end

    Note over Node1: New session established with Node2

    Node1 ->> Node2: Handshake message (id:1)

    Note over Node2: New session established with Node1

    Node2 ->> Node1: NODES (id:1)

    rect rgb(100, 0, 0)
    Note over Node1: ** FINDNODE(id:2) request timed out **
    end
Loading

I observed that the FINDNODE (id:2) request had been dropped and then the request timed out. As fjl said, I think the FINDNODE (id:2) request should be replayed with the new session.

#198 (comment)

Remember that in discv5, it is assumed the session may be dropped at any moment. If the node has multiple requests in flight with a peer, and the peer answers with a challenge (WHOAREYOU), all pending requests would need to be replayed.

src/handler/mod.rs Outdated Show resolved Hide resolved
src/handler/mod.rs Outdated Show resolved Hide resolved
@njgheorghita
Copy link
Contributor Author

@ackintosh Thanks for the review! Made the suggested changes. Now, all active requests are replayed if an active session is re-established. Tested it out locally with testground and it appears to have solved the FINDNODE request timeout. Lmk if there are any other changes / testground simulations to run.

Btw, curious as to how you're finding testground? Looks like a pretty useful tool and something that we might want to use in trin.

@ackintosh
Copy link
Member

ackintosh commented Aug 3, 2023

@njgheorghita Thanks! I will review the changes.

Yeah testground is so helpful. It's easy to simulate/reproduce particular situation/problem with testground. We are also using testground for the gossipsub protocol https://github.com/sigp/gossipsub-testground. I hope this can be of reference.

src/handler/mod.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is progressing in the right direction but there are some bugs to be fixed

node_address: &NodeAddress,
id: &RequestId,
) -> Option<RequestCall> {
match self.active_requests_mapping.entry(node_address.to_owned()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no difference between clone and to_owned for NodeContact since it's owned version it of the same type. Let's use clone which is easier to understand in general

Suggested change
match self.active_requests_mapping.entry(node_address.to_owned()) {
match self.active_requests_mapping.entry(node_address.clone()) {

Comment on lines +89 to +138
.remove(request_call.packet().message_nonce());
Some(request_call)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to check the len of requests. Otherwise you will end up with an entry with an empty list

Comment on lines 127 to 189
Entry::Occupied(mut requests) => {
match requests
.get()
.iter()
.position(|req| req.packet().message_nonce() == &nonce)
{
Some(index) => Poll::Ready(Some(Ok((
node_address,
requests.get_mut().remove(index),
)))),
None => Poll::Ready(None),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this will leave requests with empty lists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will leave requests with empty lists

Not sure I'm understanding your suggestion here. Do you want to see an empty requests replaced with Entry::Vacant after removing the last request? Based on your suggestion on line 89, it seems like the convention is that it is possible for the active_requests_mapping entry for a node to be an empty list instead of strictly Entry::Vacant.

In the case where requests is an empty list, this logic will simply return None => Poll::Ready(None). There doesn't seem to be a need to handle the empty list case any differently? But I could be missing something.

But if you'd like to see the change where the active_requests_mapping entry for a node without any active requests cannot be an empty list and must always be Entry::Vacant, I can make those changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Empty list should always be vacant. Otherwise we can let this grow unbounded on keys because we forgot to remove stale ones

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, on empty list use entry.remove()

if let Some(current_session) = self.sessions.get_mut(&node_address) {
current_session.update(session);
self.replay_active_requests::<P>(&node_address).await;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes like this deserve a comment.
I need to think this one a bit more but the session stores the old keys in case they are needed, so I'm not sure this is necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, but lmk if you feel like this needs to be removed. Although just to mention that this was the change that solved the FindNode requests timeout error described above, but if this is unnecessary I'll debug that failure some more and find a more appropriate solution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout has been solved by this change however I noticed another issue about this replaying:

As highlighted by the red square below, replay_active_requests() re-sends the FINDNODE(id:1) that has already been sent in the Handshake message. Active requests that should be re-sent is only FINDNODE(id:2) in the case of this example.

sequenceDiagram
    participant Node1
    participant Node2
    Note over Node1: Start discv5 server
    Note over Node2: Start discv5 server

    Node1 ->> Node2: message
    Node2 ->> Node1: message
    Note over Node1,Node2: Session established

    rect rgb(100, 100, 0)
    Note over Node2: ** Session expired **
    end

    rect rgb(10, 10, 10)
    Note left of Node1: Node1 sends two requests<br> **in parallel**.
    par 
    Node1 ->> Node2: FINDNODE (id:1)
    and 
    Node1 -->> Node2: FINDNODE (id:2)
    end
    end

    Note over Node2: Send WHOAREYOU<br>since the session has been expired
    Node2 ->> Node1: WHOAREYOU (id:1)

    rect rgb(100, 100, 0)
    Note over Node2: Drop FINDNODE(id:2) request<br>since WHOAREYOU already sent.
    end

    Note over Node1: New session established with Node2

    Node1 ->> Node2: Handshake message (id:1)

    Note over Node2: New session established with Node1

    rect rgb(10, 10, 10)
    Note left of Node1: Handler::replay_active_requests()
    rect rgb(100, 0, 0)
    Node1 ->> Node2: FINDNODE (id:1) <br>**duplicates with Handshake message (id:1)**
    end
    Node1 -->> Node2: FINDNODE (id:2)
    end

    Node2 ->> Node1: NODES (id:1)
    Node2 -->> Node1: NODES (id:2)

    rect rgb(100, 0, 0)
    Node2 ->> Node1: NODES (id:1)
    end
Loading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the diagram it was super helpful. I believe I've implemented a sufficient solution that filters out the handshake request when replaying active requests. But only in the case when Node2 sends the whoareyou. I'm not sure if I took the right approach here, but lmk if you see anything that needs changing.

src/handler/mod.rs Outdated Show resolved Hide resolved
Comment on lines 346 to 349
Some(Ok((node_address, _challenge))) = self.active_challenges.next() => {
// A challenge has expired. There could be pending requests awaiting this
// challenge. We process them here
self.send_next_request::<P>(node_address).await;
// A challenge has expired. There could be active requests impacted by this
// challenge. We replay them here
self.replay_active_requests::<P>(&node_address).await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to send pending_requests here not active_requests.

Current implementation, requests in pending_requests remain without being sent in case of WHOAREYOU timeout:

sequenceDiagram
    participant Node1
    participant Node2


    Node2 ->> Node1: Random packet
    Note over Node1: active_challenges.insert()
    Node1 ->> Node2: WHOAREYOU
    rect rgb(100, 100, 0)
    Note over Node2: ** Discard the WHOAREYOU packet<br>to reproduce kind of network issue. **
    end

    rect rgb(10, 10, 10)
    Note left of Node1: Node1 want to send FINDNODE to Node2<br>but active_challenge exists.<br>So insert requests into pending_requests.
    par 
    Note over Node1: pending_requests.insert()
    and 
    Note over Node1: pending_requests.insert()
    end
    end

    Note over Node1: The challenge in active_challenges<br>has been expired.

    rect rgb(100, 0, 0)
    Note over Node1: FINDNODE requests remain in pending_requests without being sent.
    end
Loading

@njgheorghita
Copy link
Contributor Author

Just a note that I've been experiencing some flaky behavior while trying to use this branch inside trin. Looks like there could be a problem with how requests are queued for a node before the session is established. Will update here once I've found a solution.

@ackintosh
Copy link
Member

@njgheorghita Could you please merge current master into this branch? There're some fixes. I'm not sure if the fixes relates to that problem though.

@njgheorghita
Copy link
Contributor Author

@ackintosh Doesn't seem to have fixed the flakiness, I should have some time in the next day or two to further investigate what's going on there and I'll report back here

@ackintosh
Copy link
Member

@njgheorghita I'm working on adding a test for replay_active_request() and trying to fix bugs I found through the test.

I don't have permission to push commits to this PR directly, so I have created a PR to your experimental branch here: njgheorghita#2

@ackintosh
Copy link
Member

After the fix has merged, I had run this on a live network and verified it works as expected.

@AgeManning
Copy link
Member

Ok, @ackintosh we think this ready for a merge?

I'll do one last quick review if so.

@ackintosh
Copy link
Member

@AgeManning Yes, please go ahead with the final review. 🙂

@AgeManning
Copy link
Member

This seems like it's been well tested. I think the best approach now is to merge it in. We're hoping to get better visibility on errors that occur is discovery, with #218

I think we should make a release with both of these PRs in and we will be able to test at scale in Lighthouse also.

@AgeManning
Copy link
Member

@divagant-martian - Want to do a final review, happy with a merge then live test?

@divagant-martian
Copy link
Collaborator

sure, I''l review this friday!

@AgeManning
Copy link
Member

As soon as the review is approved lets merge this :)

Copy link
Collaborator

@divagant-martian divagant-martian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly happy with the current state. Looks like the happy paths are appropriately handled. There are some places where I'd like to do some changes: Active requests can probably be changed to use a Smallvec instead of a Vec with a reasonable low value. Some places need a bit of love in handling the unhappy paths since we do things like this:

fn operation_that_should_be_atomic() {
    op1_that_works(); // this change remains even if op2 fails
    op2_that_fails()?;
}

but in all fairness the affected code already required some love.

That being said, I don't want to delay this further. The base is good, now time to battle test it.

@AgeManning please note that there is one required changed before merging, since there is an unwrap in the code

{
let new_packet = session
.encrypt_message::<P>(self.node_id, &request_call.encode())
.unwrap();
Copy link
Collaborator

@divagant-martian divagant-martian Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda believe this but we don't really use unwrap in prod code (release builds). Can you please change this to handle the error or at least change it to an expect with the right explanation?

This is my only requirement before merging @njgheorghita

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@njgheorghita - I had a quick look through the AES library we use. From what I can tell this function can fail if the size of the messages are over a specific limit. I was looking at version 0.9. I haven't looked at future versions.

I don't think it's reasonable to assume that in the current version and all possible future versions, that every possible message we try to encrypt will be successful. If there is an event that it's not successful, I dont think we should crash the entire application.

Failing to encrypt a message is not a catastrophic failure imo. I think it's fine to just log a CRIT message and continue running.

In which case we shouldn't use an unwrap() or expect() rather check the failure and if it fails log a crit.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts...

  • We're only dealing with requests retrieved from the active_requests store. IIUC this implies a pretty strong guarantee that the encryption will be successful, since the request has already been successfully encrypted when it originated. If this is accurate, then expect might actually expose an error in the request-handling logic - though, I'm unsure whether or not that is worth crashing...
  • If the encryption fails, do we want to do more than just log a CRIT message? Maybe it's worth removing the request entirely from the active_requests store. Although... maybe it's alright to just leave the (unable to encrypt) active request as is and let it simply timeout?

Failing to encrypt a message is not a catastrophic failure imo.

Ultimately, I agree with this. So I'm leaning towards logging a message. The question I would need to think more about (it's been a while since I've been intimate with this code 😅 ) is whether or not we need to handle un-encryptable requests? Or if it's ok to just log the message and let the request timeout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I more or less believed this could be true is because we are replaying. In that scenario, the appropriate message in expect is why this is infallible, not simply the error occurred. Maybe could have been more clear about it.

But it seems to be replaying might no be enough guarantee from @AgeManning 's comment? In that case the error needs to be handled. Here that potentially means bobbling up the error to the application. You should look into how failed requests and failed encryption in particular is handled in other parts of the code to get a good idea of what to do in this scenario.

Also, yes to logging the error (there is no CRIT, just ERROR)

Or if it's ok to just log the message and let the request timeout

Would need to check but I assume we try to guarantee that if a request times out then it is active, removing from one place and not the other would most likely break that invariant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah all good points.

My personal preference with edge cases like this, is to just log the highest kind of error. Hopefully we have good enough monitoring tools to notice it. I tend not to try to engineer wild error cases. If we hit this, I think its fine to just let the request timeout. Usually we should always send an error back to the service, so the service knows that the request broke, but I'm not sure if its even worth doing that.
This should be a very very low probability of occurring. If things start breaking because we dont handle this error case correctly, we should hopefully be able to identify the problem from the error log.

Anyway, feel free to handle how you like, but I think just taking the easy path and logging an error should be enough.

@njgheorghita
Copy link
Contributor Author

@AgeManning @divagant-martian @ackintosh Thanks everyone so much for the help on this pr! I believe all the issues have been addressed & fixed, but if there are any other final changes that you'd like to see, just lmk!

@divagant-martian divagant-martian merged commit 546e19c into sigp:master Dec 10, 2023
6 checks passed
@ackintosh
Copy link
Member

https://github.com/sigp/discv5/releases/tag/v0.4.0 🎉

AgeManning added a commit that referenced this pull request Feb 21, 2024
* Version bump to v0.3.1 (#203)

* fix docs (#204)

* refactor for removing `Discv5` prefix (#206)

* goodbye prefix

* adjust docs

* fix rpc decoding to reject extra data (#208)

* reject extra data

* reduce diff

* expected_responses remains after challenge has been completed (#210)

* Replace `Handler::spawn` with `build_handler()` and `Handler::start()`

* Test the handler's states after the handler has been terminated

* Remove expected response on handle_auth_message()

* Rename variables for readability

* Expose local ENR Arc (#214)

* Use zero ports in tests (#216)

* update dependencies (#219)

* Changing port of ReponseBody::Pong to NonZeroU16 (#220)

* Change `port` from u16 to NonZeroU16

* Fix tests

* Fix test: the PONG port can't be zero

* Fix clippy warnings

* Update 'enr' dependency (#223)

* Add support for concurrent requests to a single peer. (#200)

Co-authored-by: ackintosh <[email protected]>
Co-authored-by: Diva M <[email protected]>
Co-authored-by: Age Manning <[email protected]>

* Adjust some logs (#225)

* remove log for timed out query. This is always informed in the callback

* expand common logs, unify info on startup

* adjust auth header log

* Update src/service.rs

* Appease clippy

* Realised I was wrong. Don't need this log, my bad

* fmt

---------

Co-authored-by: Age Manning <[email protected]>

* Version bump to v0.4.0

* make tracing-subscriber a dev dep (#226)

* Fix warnings and bump libp2p (#232)

* Update session_cache_capacity from usize to NonZeroUsize

since the argument of LruCache::new is NonZeroUsize.

* Fix rustdoc

* cargo fmt

* Fix a merging mistake: lost validation in Message::decode

---------

Co-authored-by: Age Manning <[email protected]>
Co-authored-by: Divma <[email protected]>
Co-authored-by: Jack McPherson <[email protected]>
Co-authored-by: João Oliveira <[email protected]>
Co-authored-by: Milos Stankovic <[email protected]>
Co-authored-by: Nick Gheorghita <[email protected]>
Co-authored-by: Diva M <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants