-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix search performance when creating a large number of channels #39
Conversation
✅ Build pvxs 1.0.853 completed (commit 48f43d1988 by @thomasives) |
Hmm. I guess I can see how this happens. Originally, a
Have you looked to see how searches going through As I interpret Given that most overhead with handling UDP is per-packet, my goal for PVXS is to to send fewer/larger search requests and responses. Of course, a trade-off needs to be made here between efficiency and latency. Also, implicitly ease of use since the current API doesn't have a way to express a batch of channel creations. ("good, fast, cheap" appears again...)
This would make a useful addition to https://github.com/mdavidsaver/cashark/pulls |
I find myself wondering if there would be better batching by scheduling |
I had wonder about this too but I was worried about affecting the latency when creating a monitor so I did not add a delay. As you say, if we try to improve batching by adding a delay here then there is a compromise to be made between how well we batch search requests and how much we increase the channel creation latency. I suppose I had decided (without much thought) that no increase to the latency was acceptable, but perhaps that isn't the best choice. Below is some data I have collected to help us choose a delay to add (if any). Just as a number to keep in mind, I have measured the latency of creating a monitor and receiving the first event (using monitor_latency.cpp in [1]) and on my machine I get:
Note, I am running the pvxs client on the same machine as the IOC so this is a best case scenario. In a more realistic situation there will be some network latency on top of this, I guess this will be something in the 10s of milliseconds range but I don't really know for sure. We can make a rough estimate of how long we have to delay for the
Just to be clear, this is how quickly the main thread adds items to the For our channel names (e.g. TEST:CALC00000) we can fit 71 channels in a search request. On my machine, it should take the application ~2 ms to add this many channels to the work queue on average, worst case it should take ~8 ms. There will also be some overhead from the This simple analysis seems to agree roughly with what I observe with wireshark: Here I have taken 4 captures at each delay (500 us, 1 ms, 5 ms, 10 ms, 100 ms) and plotted the channels per frame for each capture against the delay. The labels are the minimum and maximum number of frames for a given delay. For 100 ms we are optimal at only 15 frames (1000 = 14 * 71 + 6). At 10 ms (which is near our worst case limit of ~8 ms), we are only occasionally not optimal with only one of the four captures taking 16 frames. At a 5 ms delay (which is around our average time of ~2 ms) we are pretty close to optimal with a spread of 15 - 19 frames. Even with a 500 us we are doing much better than no delay, so perhaps even this is acceptable. This is all pretty small sample size measurements and we are using a pretty silly example program, so all this should not be taken too seriously. And we need to bear in mind that a python application will be creating channels at a slower rate, so we would need a longer delay to get the same affects as we see here. Despite these shortcomings, I think a delay of 5 ms is probably an acceptable overhead to add to the latency of creating a channel and it looks like it would help a lot with batching channels when creating a lot at once, even for much slower python applications. @mdavidsaver What do you think? Would you compromise differently? [1] https://gist.github.com/thomasives/1df7f1c668a465e8201819434e5b5112 |
Just for completeness, I think the ideal would be some sort of adaptive delay where if the application is creating a lot of channels, we wait to batch the search requests whereas if they are only creating one or two we do not delay and prioritise latency. I think we probably could get what is effectively an adaptive delay if we had some priority work queue to schedule things with. However, I think it would increase the complexity of pvxs by a lot and there would be lots of edge cases to work out. Not sure that this is really a route we want to go down as it will be a lot of work and it is unclear how well this would work in practice.
I think if we want different use cases to perform optimally (even if we have some clever adaptive system), then we ultimately need to give more control to the application so they can tell us what they are trying to do. However, we might not need to do something as hard to use well as e.g. a
|
Based on this, I'm convinced that 100ms is excessive. I'd like to start with 10ms on the theory that your tests are done on a host which is more performant than average.
During the recent codeathon at DLS, @ralphlange and I spent some time troubleshooting an odd performance inversion he sees with libca where a large client takes hours to reconnect through cagateway. It turns out that libca searching attempts to adaptively throttle broadcast bandwidth by limiting the number of search packets sent based on the fraction of positive responses received recently. I see this as an example of how tricky "adaptive" can be. Also the fact that the searching algorithms of libca still have issues decades on is an indication that there is no optimal solution to this problem.
As future work, I'm thinking about adding a search policy selection when creating client context and/or channel. The options would be "low latency" (the default) to put the newly created channel in When talking about "connect 10000" PVs, I'm used to thinking about non-interactive clients like a data archiver system where (historically) it was considered impolite to slam the network with so many UDP broadcasts all at once. |
…earch When creating a large number of Channels at once, we can end up calling `ContextImpl::poke(true)` many times in quick succession. This results in a flood of UDP broadcasts where we are searching for channels that we only just sent out the initial search request for. This can easily lead to packets getting lost and us not receiving a reply for some Channels. Moreover, as we keep resending search requests for Channels, we reschedule them further and further in the future (as `nSearch` is increased). After the dust settles and we stop poking, this can result in a wait of several seconds before a Channel which we have not found is searched for again. In this commit we avoid this issue by using a separate bucket to hold channels waiting for their initial search request. Rather than poking `tickSearch` to do the initial search and also resend requests for outstanding channels, we schedule a call to new call `tickSearch` which will only send the initial search requests. As such, we will avoid rebroadcasting search requests for channels we have only just searched for. We have prompted the `discover` bool to an enum to distinguish between the now three different situations `tickSearch` can be called in.
… queue By using `tcp_loop.dispatch` to schedule the initial search for a channel, we are placing the callback into the same work queue that is used by e.g. `MonitorBuilder::exec` to schedule the call to `Channel::build`. In situations where lots of channels are being created simultaneously this can result in lots of single channel search requests being sent because the work queue alternates between calls to build a channel and the initial search. In this commit we instead use a dedicated `evevent` to schedule the initial search to allow the `initialSearchBucket` to be filled before we send the initial search request. We delay the initial search by 10 ms to give more time for the bucket to be filled. See github.com/epics-base/pull/39 for a discussion of how this delay was chosen.
Sorry I accidentally closed the PR when I tried to rebase using the github interface. I did not realise the button I pressed would do that. I have added a 10 ms delay to the initial search as you suggested.
Indeed, this is a hard problem to implement correctly. |
See epics-base/epics-base#372 for the search performance issue seen on CA that @mdavidsaver mentioned above. |
✅ Build pvxs 1.0.866 completed (commit fa6b52c1f7 by @thomasives) |
It looks like you were able to straighten this out. Although for some reason I had to re-enable the GHA runs. Last I checked I only had to do this once per account, but then github.com is a moving target. Anyway, this change looks ok to me. |
… queue By using `tcp_loop.dispatch` to schedule the initial search for a channel, we are placing the callback into the same work queue that is used by e.g. `MonitorBuilder::exec` to schedule the call to `Channel::build`. In situations where lots of channels are being created simultaneously this can result in lots of single channel search requests being sent because the work queue alternates between calls to build a channel and the initial search. In this commit we instead use a dedicated `evevent` to schedule the initial search to allow the `initialSearchBucket` to be filled before we send the initial search request. We delay the initial search by 10 ms to give more time for the bucket to be filled. See github.com/epics-base/pull/39 for a discussion of how this delay was chosen.
✅ Build pvxs 1.0.867 completed (commit cf47f82b21 by @thomasives) |
I have been looking at the test failures on OS X and I am struggling to understand how my changes could have caused this. Unfortunately, I don't have access to an OS X machine, so it is hard for me to investigate. The test that is failing is Regardless of why this is happening, I don't see how my changes can cause it as this is testing the evhelper classes which are at a lower level than my changes. @mdavidsaver any ideas? |
The issue with |
In |
Fixing I've been working with the changes in this PR, which so far work as expected. I found one mistake I made before, which this PR perpetuates, where a discovery |
… queue By using `tcp_loop.dispatch` to schedule the initial search for a channel, we are placing the callback into the same work queue that is used by e.g. `MonitorBuilder::exec` to schedule the call to `Channel::build`. In situations where lots of channels are being created simultaneously this can result in lots of single channel search requests being sent because the work queue alternates between calls to build a channel and the initial search. In this commit we instead use a dedicated `evevent` to schedule the initial search to allow the `initialSearchBucket` to be filled before we send the initial search request. We delay the initial search by 10 ms to give more time for the bucket to be filled. See github.com/epics-base/pvxs/pull/39 for a discussion of how this delay was chosen.
Merged as of fe5a35e. Thanks! |
When creating a large number of channels (~1000) in a short amount of time, pvxs can end up sending several search requests for each channel, without giving the IOC a reasonable amount of time to respond. This results in search requests going missing, delaying how long it takes for all the channels to be found. The issue is that when we send an initial search for a channel we also send search requests for channels we are waiting on. This PR fixes this by separating the sending on initial search requests and the re-sending of failed search requests. More details in the commit messages.
To demonstrate the issue I am using this example program. This program creates 1000 monitors to PV's provided by the
example_ioc
from ajgdls/EPICSPyClientPerformance.git. When investigating the performance on my machine using cashark[1], I get the following when using the master branch:We are, on average, sending ~12 search requests for each PV and it is taking ~8 s to finish finding them all. With this PR I get the following in the same situation:
With the PR, we only send a single search request for each PV and we have reduced the time it takes to finish finding all the PV's by ~250x compared to the master branch. On my system, these pcaps are representative of what happens for each version.
@mdavidsaver Please let me know if you can think of a better way of fixing this. I'm happy to iterate on this if needs be.
Ping @ajgdls @coretl.
[1] with a small patch to add a
pva.count
field.