Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WFSSL-112 Use PooledByteBuffer to fix performance issue caused by ByteBuffer allocation #128

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

heyuanliu-intel
Copy link

@heyuanliu-intel heyuanliu-intel commented Jan 14, 2023

During using Facebook Airlift sample application to do benchmarking, the CPU user utilization is very low (about 9%) but CPU sys utilization is very high (about 66%) if we use wildfly-openssl as openssl provider and set the custom engine to QAT. From the Java Thread Dump File, we can found the performance bottleneck is the ByteBuffer allocation. So use pooled byte buffer to replace the direct byte buffer allocation to fix the performance bottleneck.

@heyuanliu-intel
Copy link
Author

heyuanliu-intel commented Jan 14, 2023

Benchmark application is Facebook Airlift sample application and benchmark client is wrk.
Below is the performance results for this code change. From the results, pooled byte buffer can boost performance about 5.5X.

ByteBuffer CPU User Utilization CPU Sys Utilization Performance Result (Requests/sec)
Direct Allocation 8.8% 66.1% 85991.90
Pooled ByteBuffer 60.1% 12.2% 434651.57

Below is the benchmark result for direct byte buffer allocation:

top - 21:43:55 up 129 days, 21:07,  8 users,  load average: 38.51, 67.03, 54.50
Tasks: 1417 total,   1 running, 1415 sleeping,   0 stopped,   1 zombie
%Cpu(s):  8.8 us, 66.1 sy,  0.0 ni, 24.1 id,  0.0 wa,  0.3 hi,  0.7 si,  0.0 st
MiB Mem : 386115.5 total, 218927.6 free,  64021.4 used, 103166.5 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 319320.7 avail Mem

[root@icx02 AirliftPerformance]# ./run_wrk.sh 
run long connection testing
Running 2m test @ https://localhost:9300/v1/service
  128 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.83ms   27.51ms   1.99s    99.89%
    Req/Sec     3.41k     2.10k    9.41k    55.67%
  10327570 requests in 2.00m, 1.51GB read
  Socket errors: connect 0, read 0, write 0, timeout 184
Requests/sec:  85991.90
Transfer/sec:     12.88MB

Below is the benchmark result for pooled byte buffer:

top - 21:54:28 up 129 days, 21:17,  7 users,  load average: 62.56, 66.70, 63.09
Tasks: 1414 total,   1 running, 1412 sleeping,   0 stopped,   1 zombie
%Cpu(s): 60.1 us, 12.2 sy,  0.0 ni, 23.2 id,  0.0 wa,  0.6 hi,  3.8 si,  0.0 st
MiB Mem : 386115.5 total, 217239.7 free,  65703.4 used, 103172.4 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 317638.8 avail Mem

[root@icx02 AirliftPerformance]# ./run_wrk.sh 
run long connection testing
Running 2m test @ https://localhost:9300/v1/service
  128 threads and 2000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    75.12ms  104.99ms 971.58ms   80.69%
    Req/Sec     3.45k     2.13k   26.53k    67.90%
  52201515 requests in 2.00m, 7.63GB read
Requests/sec: 434651.57
Transfer/sec:     65.08MB

@fjuma
Copy link
Contributor

fjuma commented Jan 17, 2023

@stuartwdouglas Just wanted to get your thoughts on this one. I see something similar was previously discussed in #120 but the changes that were discussed there are no longer available.

Copy link
Contributor

@stuartwdouglas stuartwdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not something we want out of the box, it needs to be pluggable.

The issues are:

  • Wildfly has its own pool, we want to use this one and not create a new one.
  • The ThreadLocal cache in the pool can cause memory leaks, if the engine is used by transient threads.

@fjuma
Copy link
Contributor

fjuma commented Jan 23, 2023

@stuartwdouglas Thanks for taking a look at this! To make this pluggable, one option might be to introduce a system property that could be used to specify that the pooled byte buffer should be used instead. The new system property could default to false so the behaviour is unchanged by default. What do you think?

Just to provide some more context, @heyuanliu-intel has been working on adding the ability to support a custom engine to WildFly OpenSSL. With his changes, it will be possible to specify a system property that indicates the engine that should be used. @heyuanliu-intel has been making use of the Intel QAT engine and has been testing performance with this engine outside of WildFly.

@heyuanliu-intel
Copy link
Author

@stuartwdouglas @fjuma
I think this is a performance issue that should be fixed. I have used a simple application that do a pressure test with wildfly-openssl without custom engine. It has the same performance issue. So if make it pluggable, when should it be turned on and off ? If the user turns it off, it will have performance issue under the pressure of the application, I think this is unacceptable for the user.

@stuartwdouglas
Copy link
Contributor

I think on by default is ok, but it should provide a simple way to both disable it, and to allow a custom pool implementation.

I think initially just a system property option to disable it is probably ok.

@heyuanliu-intel
Copy link
Author

The issues are:

  • Wildfly has its own pool, we want to use this one and not create a new one.
  • The ThreadLocal cache in the pool can cause memory leaks, if the engine is used by transient threads.

In the previous version, I have used an ByteBuffer pool created by myself.
Now, I have used the wildfly's own pool (org.wildfly.openssl.DefaultByteBufferPool) to avoid the issues listed here.

@heyuanliu-intel
Copy link
Author

heyuanliu-intel commented Jan 24, 2023

and to allow a custom pool implementation.

I think if wildfly-openssl can be managed by lightweight dependency injection framework such as Google Guice, then it is easy to allow a custom pool implementation. But it will be huge change for wildfly-openssl. So there is no easy way for the user to provide a custom pool implementation now and this shouldn't be considered now.

And I found org.wildfly.openssl.OpenSSLSocket has used this byte buffer pool also. Should that make the pool pluggable also to keep the configuration consistency?

Actually I think for the pool solution, out of box is better than make it pluggable now because I don't know which situation should make it turn off.

@fjuma
Copy link
Contributor

fjuma commented Jan 26, 2023

@stuartwdouglas Thanks for the feedback!

@heyuanliu-intel For now, I think we could introduce a system property that could be used to disable the use of the pooled byte buffer in OpenSSLEngine so that if someone wants to go back to the original behaviour they could do so.

In the future, we could add the ability to specify a custom ByteBuffer implementation to use.

@heyuanliu-intel
Copy link
Author

I will reserve my personal opinion. I have modified the code according to comments.

@fjuma
Copy link
Contributor

fjuma commented Jan 30, 2023

I will reserve my personal opinion. I have modified the code according to comments.

@heyuanliu-intel Please feel free to share any thoughts, concerns, etc.

@heyuanliu-intel
Copy link
Author

heyuanliu-intel commented Jan 31, 2023

In my opinion: this is a bug that should be fixed instead of letting user to turn it off.
If the user turns it off, it will have huge performance issue. This isn't a program that should let user choose.

@stuartwdouglas
Copy link
Contributor

If it is turned on it can cause an unfixable memory leak for some situations. Basically any situation where you have threads that use the cache that don't last for the full life of the application the 'ThreadLocal' in the cache will continue to hold onto the direct buffers even after the thread is gone.

If you are using a thread pool this is less of an issue, as long as the thread pool does not dynamically resize. Even then you can still get problems (e.g. if a thread pool is recreated on redeployment you end up with memory leaks on redeploy).

This is not just a theoretical problem, we have seen it before in WildFly.

@heyuanliu-intel
Copy link
Author

heyuanliu-intel commented Jan 31, 2023

I think the main reason that causes the memory leak is caused by "ThreadLocal", it is discussed on stackoverflow and the link is [ThreadLocal & Memory Leak]. The final conclusion is that there is no good way to avoid memory leak by using "ThreadLocal" in containerized environment.

My question is : How about using a global synchronized Map to keep the instances instead of using ThreadLocal? Although it will loss of the performance compared with non-synchronized ThreadLocal but it is safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants