Fix search performance when creating a large number of channels #39

thomasives · 2023-04-03T12:42:06Z

When creating a large number of channels (~1000) in a short amount of time, pvxs can end up sending several search requests for each channel, without giving the IOC a reasonable amount of time to respond. This results in search requests going missing, delaying how long it takes for all the channels to be found. The issue is that when we send an initial search for a channel we also send search requests for channels we are waiting on. This PR fixes this by separating the sending on initial search requests and the re-sending of failed search requests. More details in the commit messages.

To demonstrate the issue I am using this example program. This program creates 1000 monitors to PV's provided by the example_ioc from ajgdls/EPICSPyClientPerformance.git. When investigating the performance on my machine using cashark[1], I get the following when using the master branch:

> tshark -r master.pcap -Y "pva.command == SEARCH" -T fields -e frame.time_relative -e pva.count | 
    awk 'BEGIN {count=0;sum=0} NR==1 {start=$1} {count+=1;sum+=$2} END {end=$1; printf "npvs=%d nframes=%d time_delta=%f s\n", sum, count, end-start}'
npvs=11931 nframes=418 time_delta=8.033979 s

We are, on average, sending ~12 search requests for each PV and it is taking ~8 s to finish finding them all. With this PR I get the following in the same situation:

> tshark -r pr.pcap -Y "pva.command == SEARCH" -T fields -e frame.time_relative -e pva.count |
    awk 'BEGIN {count=0;sum=0} NR==1 {start=$1} {count+=1;sum+=$2} END {end=$1; printf "npvs=%d nframes=%d time_delta=%f s\n", sum, count, end-start}'
npvs=1000 nframes=282 time_delta=0.037329 s

With the PR, we only send a single search request for each PV and we have reduced the time it takes to finish finding all the PV's by ~250x compared to the master branch. On my system, these pcaps are representative of what happens for each version.

@mdavidsaver Please let me know if you can think of a better way of fixing this. I'm happy to iterate on this if needs be.

Ping @ajgdls @coretl.

[1] with a small patch to add a pva.count field.

AppVeyorBot · 2023-04-03T15:42:41Z

✅ Build pvxs 1.0.853 completed (commit 48f43d1988 by @thomasives)

mdavidsaver · 2023-04-04T18:00:51Z

... sending several search requests for each channel, without giving the IOC a reasonable amount of time to respond.

Hmm. I guess I can see how this happens. Originally, a hurryUp() was needed after a batch of channel creations to reduce connect time. e52ae67 bypassed this to always go faster. I kept forgetting about hurryUp(), and P4P has no way to express it. Consequently, prior to 89f9c54 batching of search messages was poor (lots of small UDP packets). Now search requests and replies are batched better, but with so many PVs I think I can see how poke(true) spins the timer wheel much faster than I intended.

... I get the following ...

Have you looked to see how searches going through initialSearchBucket are being batched? eg. plotting pva.count or frame size vs. frame number for pva.command == SEARCH for each pva.direction with the wireshark GUI.

As I interpret npvs=1000 nframes=282, some batch must be happening. But an average of ~4 PVs per packet is much less than the ideal of 1400 byte packets (cf. maxSearchPayload).

Given that most overhead with handling UDP is per-packet, my goal for PVXS is to to send fewer/larger search requests and responses. Of course, a trade-off needs to be made here between efficiency and latency. Also, implicitly ease of use since the current API doesn't have a way to express a batch of channel creations. ("good, fast, cheap" appears again...)

[1] with a small patch to add a pva.count field.

This would make a useful addition to https://github.com/mdavidsaver/cashark/pulls

mdavidsaver · 2023-04-04T18:19:09Z

@mdavidsaver Please let me know if you can think of a better way of fixing this. I'm happy to iterate on this if needs be.

I find myself wondering if there would be better batching by scheduling initialSearcher with a short delay (say 100ms) as a coalescence time.

thomasives · 2023-04-06T08:29:37Z

I find myself wondering if there would be better batching by scheduling initialSearcher with a short delay (say 100ms) as a coalescence time.

I had wonder about this too but I was worried about affecting the latency when creating a monitor so I did not add a delay. As you say, if we try to improve batching by adding a delay here then there is a compromise to be made between how well we batch search requests and how much we increase the channel creation latency. I suppose I had decided (without much thought) that no increase to the latency was acceptable, but perhaps that isn't the best choice. Below is some data I have collected to help us choose a delay to add (if any).

Just as a number to keep in mind, I have measured the latency of creating a monitor and receiving the first event (using monitor_latency.cpp in [1]) and on my machine I get:

Monitor latency = 6.24228 +/- 0.0446699 (min=1.46271, max=16.9533) ms

Note, I am running the pvxs client on the same machine as the IOC so this is a best case scenario. In a more realistic situation there will be some network latency on top of this, I guess this will be something in the 10s of milliseconds range but I don't really know for sure.

We can make a rough estimate of how long we have to delay for the initialSearchBucket to have a least as many channels as fit in a single UDP packet by measuring the rate that channels get created on the main thread. I have measured this too (with create_throughput.cpp from [1], 500 samples), on my machine I get:

Channel create rate = 40297.5 +/- 1082.34 (min=9081.91, max=137333) channels per second

Just to be clear, this is how quickly the main thread adds items to the tcp_loop work queue, not adding them to the initialSearchBucket. The loop is not optimised, so we could probably go faster, but hopefully it is reasonably representative of a C++ application that is creating a lot of channels at once. A python application will almost certainly be much slower than this.

For our channel names (e.g. TEST:CALC00000) we can fit 71 channels in a search request. On my machine, it should take the application ~2 ms to add this many channels to the work queue on average, worst case it should take ~8 ms. There will also be some overhead from the tcp_loop having to actually process the work queue and add the channels to the initialSearchBucket, but this is a bit harder to measure so I haven't bothered.

This simple analysis seems to agree roughly with what I observe with wireshark:

Here I have taken 4 captures at each delay (500 us, 1 ms, 5 ms, 10 ms, 100 ms) and plotted the channels per frame for each capture against the delay. The labels are the minimum and maximum number of frames for a given delay.

For 100 ms we are optimal at only 15 frames (1000 = 14 * 71 + 6). At 10 ms (which is near our worst case limit of ~8 ms), we are only occasionally not optimal with only one of the four captures taking 16 frames. At a 5 ms delay (which is around our average time of ~2 ms) we are pretty close to optimal with a spread of 15 - 19 frames. Even with a 500 us we are doing much better than no delay, so perhaps even this is acceptable.

This is all pretty small sample size measurements and we are using a pretty silly example program, so all this should not be taken too seriously. And we need to bear in mind that a python application will be creating channels at a slower rate, so we would need a longer delay to get the same affects as we see here. Despite these shortcomings, I think a delay of 5 ms is probably an acceptable overhead to add to the latency of creating a channel and it looks like it would help a lot with batching channels when creating a lot at once, even for much slower python applications.

@mdavidsaver What do you think? Would you compromise differently?

[1] https://gist.github.com/thomasives/1df7f1c668a465e8201819434e5b5112

thomasives · 2023-04-06T09:59:23Z

Given that most overhead with handling UDP is per-packet, my goal for PVXS is to to send fewer/larger search requests and responses. Of course, a trade-off needs to be made here between efficiency and latency.

Just for completeness, I think the ideal would be some sort of adaptive delay where if the application is creating a lot of channels, we wait to batch the search requests whereas if they are only creating one or two we do not delay and prioritise latency. I think we probably could get what is effectively an adaptive delay if we had some priority work queue to schedule things with. However, I think it would increase the complexity of pvxs by a lot and there would be lots of edge cases to work out. Not sure that this is really a route we want to go down as it will be a lot of work and it is unclear how well this would work in practice.

Also, implicitly ease of use since the current API doesn't have a way to express a batch of channel creations.

I think if we want different use cases to perform optimally (even if we have some clever adaptive system), then we ultimately need to give more control to the application so they can tell us what they are trying to do. However, we might not need to do something as hard to use well as e.g. a BatchedMonitorBuilder. Perhaps we could get away with just allowing the application to set this delay somehow. For example we could just add a beginBatching() and endBatching() functions which respectively increase and decrease the initialSeracher delay to some fix values. I would perhaps leave it for now however and see how this performs in the real world.

This would make a useful addition to https://github.com/mdavidsaver/cashark/pulls

mdavidsaver/cashark#13

mdavidsaver · 2023-04-07T15:57:25Z

Here I have taken 4 captures at each delay (500 us, 1 ms, 5 ms, 10 ms, 100 ms) ...

Based on this, I'm convinced that 100ms is excessive. I'd like to start with 10ms on the theory that your tests are done on a host which is more performant than average.

... I think the ideal would be some sort of adaptive delay ...

During the recent codeathon at DLS, @ralphlange and I spent some time troubleshooting an odd performance inversion he sees with libca where a large client takes hours to reconnect through cagateway. It turns out that libca searching attempts to adaptively throttle broadcast bandwidth by limiting the number of search packets sent based on the fraction of positive responses received recently.

I see this as an example of how tricky "adaptive" can be. Also the fact that the searching algorithms of libca still have issues decades on is an indication that there is no optimal solution to this problem.

... Perhaps we could get away with just allowing the application to set this delay somehow. ...

As future work, I'm thinking about adding a search policy selection when creating client context and/or channel. The options would be "low latency" (the default) to put the newly created channel in initialSearchBucket, vs. "constant load" which would put it into a random or least full bucket.

When talking about "connect 10000" PVs, I'm used to thinking about non-interactive clients like a data archiver system where (historically) it was considered impolite to slam the network with so many UDP broadcasts all at once.

…earch When creating a large number of Channels at once, we can end up calling `ContextImpl::poke(true)` many times in quick succession. This results in a flood of UDP broadcasts where we are searching for channels that we only just sent out the initial search request for. This can easily lead to packets getting lost and us not receiving a reply for some Channels. Moreover, as we keep resending search requests for Channels, we reschedule them further and further in the future (as `nSearch` is increased). After the dust settles and we stop poking, this can result in a wait of several seconds before a Channel which we have not found is searched for again. In this commit we avoid this issue by using a separate bucket to hold channels waiting for their initial search request. Rather than poking `tickSearch` to do the initial search and also resend requests for outstanding channels, we schedule a call to new call `tickSearch` which will only send the initial search requests. As such, we will avoid rebroadcasting search requests for channels we have only just searched for. We have prompted the `discover` bool to an enum to distinguish between the now three different situations `tickSearch` can be called in.

… queue By using `tcp_loop.dispatch` to schedule the initial search for a channel, we are placing the callback into the same work queue that is used by e.g. `MonitorBuilder::exec` to schedule the call to `Channel::build`. In situations where lots of channels are being created simultaneously this can result in lots of single channel search requests being sent because the work queue alternates between calls to build a channel and the initial search. In this commit we instead use a dedicated `evevent` to schedule the initial search to allow the `initialSearchBucket` to be filled before we send the initial search request. We delay the initial search by 10 ms to give more time for the bucket to be filled. See github.com/epics-base/pull/39 for a discussion of how this delay was chosen.

thomasives · 2023-04-11T08:50:18Z

Sorry I accidentally closed the PR when I tried to rebase using the github interface. I did not realise the button I pressed would do that.

I have added a 10 ms delay to the initial search as you suggested.

I see this as an example of how tricky "adaptive" can be. Also the fact that the searching algorithms of libca still have issues decades on is an indication that there is no optimal solution to this problem.

Indeed, this is a hard problem to implement correctly.

ralphlange · 2023-04-11T10:26:37Z

See epics-base/epics-base#372 for the search performance issue seen on CA that @mdavidsaver mentioned above.

AppVeyorBot · 2023-04-11T12:00:27Z

✅ Build pvxs 1.0.866 completed (commit fa6b52c1f7 by @thomasives)

src/client.cpp

mdavidsaver · 2023-04-11T15:12:37Z

Sorry I accidentally closed the PR ...

It looks like you were able to straighten this out. Although for some reason I had to re-enable the GHA runs. Last I checked I only had to do this once per account, but then github.com is a moving target.

Anyway, this change looks ok to me.

… queue By using `tcp_loop.dispatch` to schedule the initial search for a channel, we are placing the callback into the same work queue that is used by e.g. `MonitorBuilder::exec` to schedule the call to `Channel::build`. In situations where lots of channels are being created simultaneously this can result in lots of single channel search requests being sent because the work queue alternates between calls to build a channel and the initial search. In this commit we instead use a dedicated `evevent` to schedule the initial search to allow the `initialSearchBucket` to be filled before we send the initial search request. We delay the initial search by 10 ms to give more time for the bucket to be filled. See github.com/epics-base/pull/39 for a discussion of how this delay was chosen.

AppVeyorBot · 2023-04-11T18:12:43Z

✅ Build pvxs 1.0.867 completed (commit cf47f82b21 by @thomasives)

thomasives · 2023-04-14T08:59:08Z

I have been looking at the test failures on OS X and I am struggling to understand how my changes could have caused this. Unfortunately, I don't have access to an OS X machine, so it is hard for me to investigate.

The test that is failing is testsock.cpp:test_udp, we are getting an EAGAIN (errno = 35 on OS X) from a recvmsg syscall. I don't quite understand how this is possible because I do not see where the socket gets marked as SOCK_NONBLOCK/O_NONBLOCK or a SO_RCVTIMEO is set. Without these options the socket should block until data is received (and we receive nothing as ret=-1). I am basing this on Linux/FreeBSD man pages, however, so perhaps the OS X socket implementation has different behaviour...

Regardless of why this is happening, I don't see how my changes can cause it as this is testing the evhelper classes which are at a lower level than my changes.

@mdavidsaver any ideas?

mdavidsaver · 2023-04-14T14:31:54Z

The issue with testsock is known. cf. #23 When writing it I apparently made incorrect assumptions about how buffering is done by an OS behind the sockets interface. I haven't been able to come up with a more robust way to test the behavior I want which I have time to implement.

mdavidsaver · 2023-04-14T15:00:45Z

... I do not see where the socket gets marked as SOCK_NONBLOCK/O_NONBLOCK ...

In evsocket::evsocket() via a libevent wrapper.

https://github.com/mdavidsaver/pvxs/blob/c8f28f373af6208a9b23c04a0201a61ba161a113/src/evhelper.cpp#L393

mdavidsaver · 2023-04-14T15:09:37Z

Fixing testsock would be a separate PR.

I've been working with the changes in this PR, which so far work as expected. I found one mistake I made before, which this PR perpetuates, where a discovery tickSearch() incorrectly reschedules searchTimer. I'm also looking at changing poke(). To remove the force argument so the hold off time always applies. And instead of moving up one bucket, to run through the entire bucket ring once at a faster rate.

… queue By using `tcp_loop.dispatch` to schedule the initial search for a channel, we are placing the callback into the same work queue that is used by e.g. `MonitorBuilder::exec` to schedule the call to `Channel::build`. In situations where lots of channels are being created simultaneously this can result in lots of single channel search requests being sent because the work queue alternates between calls to build a channel and the initial search. In this commit we instead use a dedicated `evevent` to schedule the initial search to allow the `initialSearchBucket` to be filled before we send the initial search request. We delay the initial search by 10 ms to give more time for the bucket to be filled. See github.com/epics-base/pvxs/pull/39 for a discussion of how this delay was chosen.

mdavidsaver · 2023-04-17T19:25:28Z

Merged as of fe5a35e. Thanks!

thomasives closed this Apr 11, 2023

thomasives force-pushed the master branch from 1193f4d to c8f28f3 Compare April 11, 2023 08:30

thomasives reopened this Apr 11, 2023

mdavidsaver reviewed Apr 11, 2023

View reviewed changes

src/client.cpp Outdated Show resolved Hide resolved

thomasives force-pushed the master branch from d25a263 to b594183 Compare April 11, 2023 15:15

mdavidsaver closed this Apr 17, 2023

mdavidsaver mentioned this pull request Apr 17, 2023

client: re-define the meaning of poke() #41

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix search performance when creating a large number of channels #39

Fix search performance when creating a large number of channels #39

thomasives commented Apr 3, 2023

AppVeyorBot commented Apr 3, 2023

mdavidsaver commented Apr 4, 2023

mdavidsaver commented Apr 4, 2023

thomasives commented Apr 6, 2023 •

edited

Loading

thomasives commented Apr 6, 2023

mdavidsaver commented Apr 7, 2023

thomasives commented Apr 11, 2023

ralphlange commented Apr 11, 2023

AppVeyorBot commented Apr 11, 2023

mdavidsaver commented Apr 11, 2023

AppVeyorBot commented Apr 11, 2023

thomasives commented Apr 14, 2023

mdavidsaver commented Apr 14, 2023

mdavidsaver commented Apr 14, 2023

mdavidsaver commented Apr 14, 2023

mdavidsaver commented Apr 17, 2023

Fix search performance when creating a large number of channels #39

Fix search performance when creating a large number of channels #39

Conversation

thomasives commented Apr 3, 2023

AppVeyorBot commented Apr 3, 2023

mdavidsaver commented Apr 4, 2023

mdavidsaver commented Apr 4, 2023

thomasives commented Apr 6, 2023 • edited Loading

thomasives commented Apr 6, 2023

mdavidsaver commented Apr 7, 2023

thomasives commented Apr 11, 2023

ralphlange commented Apr 11, 2023

AppVeyorBot commented Apr 11, 2023

mdavidsaver commented Apr 11, 2023

AppVeyorBot commented Apr 11, 2023

thomasives commented Apr 14, 2023

mdavidsaver commented Apr 14, 2023

mdavidsaver commented Apr 14, 2023

mdavidsaver commented Apr 14, 2023

mdavidsaver commented Apr 17, 2023

thomasives commented Apr 6, 2023 •

edited

Loading