-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lock contention in OutbackCDX PoolingHttpClientConnectionManager #57
Comments
I set up a load test class and ran it against OutbackCDX on my laptop. I found some scaling issues on both sides. Firstly, OutbackCDX needs quite a lot of additional threads when heavily loaded (presumably it takes a while to drop them when requests finish). With 1000 clients, 2000 OutbackCDX threads were not enough to prevent requests getting dropped. Secondly, with the client pool at 1000 threads the clients appears to start failing occasionally, and giving OutbackCDX even more threads does not help. Some connections hang for a long time (>60s!), and whatever timeout is used there are
The first thing to try is to tune up OutbackCDX so there's definitely plenty of overhead. Grepping the logs shows lines like:
So it is running out and so the number should be raised. Then we can look at improving the client behaviour when the thread pool is large. |
Hmmm, changes to ThreadLocal HTTP client setup, no change. And ulimits don't seem to be the problem.
|
Hmm, further experimentation, including running this against a faux OutbackCDX server (actually an NGINX instance that returns the same thing all the time), indicates that this appears to be a problem with OutbackCDX itself. Momentarily switching to a more recent version of Apache HTTP Client gives somewhat more detailed errors. From NGINX FauxCDX:
Running against OutbackCDX:
|
Hm, so the code was not well written to cope with whatever the underlying failure mode was, and that has contributed to the mess. Having tidied up the code and made it retry more sensibly, it's more consistent and the transient failures are overcome. Having just re-run following this tweak: It ran with no errors!
Note that this is much faster but largely because I'm running against OutbackCDX outside a container (which is interesting in itself). Running against the containerised version:
|
So, the kernel change I made was:
Since doing that, it all seems a bit happier. Setting it back to 15000 to see if that reverts the behaviour. Hmmm.
Run two more times and consistently getting one Okay, trying a much smaller value (1000) and more runs at that value... Yep, consistently fine. Still getting some |
Okay, so running natively with a low socket release timeout (1000) example results:
Same running under Docker...
Again, a lot of socket errors. It's possible using Docker network means a lot more ephemeral ports tied together. What about switching to |
So, on Mac, there's a weird Docker overhead and a weird slowdown at 1000 threads. But that aside, as per nla/outbackcdx#389, using OutbackCDX's Undertow server mode (rather than NanoHTTPD), it seems to be stable and slightly faster. |
So, due to problems with NanoHTTPD and the JRE itself (which started behaving a bit better following an upgrade!), I think I lost track of the original problem, and would like to try switching back to thread-local BasicHttpConnectionManager usage. The Heritrix ToeThreads use this approach, so I'd like to try that rather than use the pool. Right now running the tests is incredibly slow so something weird is going on. |
Okay, so some macOS issues aside, seems to work fine now. Change code to make pooling OutbackCDX connections optional. Speed is roughly the same under simple load testing so shouldn't make things worse, and if there is a thread contention issue in the pool, may improve things somewhat. |
Hm, using ThreadLocal leads to multiple thread errors I ill understand:
|
Hm, okay, so the client was being held in the class scope, and that being re-used across threads probably cause the state problems. Making the client unique per request works, but the load test goes badly because we run out of transient ports (no connections are being re-used). So, perhaps we can make the HttpClient (which contains the connnection manager) thread local, and get it working.... |
Yep, |
Running at large scale, a many threads appear to be in a locked/waiting state, thrashing a lock in the
PoolingHttpClientConnectionManager
used by theOutbackCDXClient
.We are using a pool for multithreaded execution, which appears to have known contention issues at least in some cases. Note also, from the javadocs:
The documentation implies that the stale check could be disabled to increase performance (see CoreConnectionPNames.STALE_CONNECTION_CHECK).
Given we're generally managing pools of threads anyway, particularly the
ToeThread
s, it may make more sense to use aThreadLocal
HTTP client rather than having a separate pool. This would ensure no pool contention.We likely need a performance test case that runs a lot of
OutbackCDXClient
s that can detect the pool contention problem.The text was updated successfully, but these errors were encountered: