Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

io_uring new uSockets backend #1603

Open
uNetworkingAB opened this issue May 6, 2023 · 14 comments
Open

io_uring new uSockets backend #1603

uNetworkingAB opened this issue May 6, 2023 · 14 comments
Labels

Comments

@uNetworkingAB
Copy link
Contributor

As of Linux 5.19 I now see slight outperformance with io_uring for an actual networked ping/ping benchmark involving client and server TCP processes. It's less than 10% on my computers, often less than 5%, so we do not see any gigantic gains since we come from a highly optimized epoll usage. But there are gains, esp. for mitigated systems.

Long term plan is to add a new backend in uSockets. It's very, very low priority but I'm interested long term

@uNetworkingAB
Copy link
Contributor Author

Scratch that - as of Linux 6.0 there are significant gains with io_uring. I can see 18% gains in testing if using the 6.0 features, and they make a lot of sense now even for low memory cases and many restrictions have been removed.

This is going to be a new backend in uSockets, probably the default one as soon as possible. I like it and it will be nice to have a path that is entirely optimized solely for Linux.

@uNetworkingAB
Copy link
Contributor Author

Scratch that, the gains I'm seeing is more like 21%

@uNetworkingAB
Copy link
Contributor Author

Linux 6.0+ is crazy fast. I have a build of uWS that does HTTP messaging with URL router over io_uring, faster than raw TCP messaging over epoll.

This perf. is going to compound for pub/sub use cases and there's no doubt that io_uring will be default in v21. This is a game changer.

My shitty 10 year old semi-budget PC can do 406k messages per second, on 1 CPU core. This was 325k with epoll. And it does 336k URL routed HTTP messages per second. That's slap in the face kind of performance.

@dalisoft
Copy link

@uNetworkingAB Is there will be release of uWebSockets.js as alpha for this implementation?

@uNetworkingAB
Copy link
Contributor Author

I was still experimenting with a POC 5 days ago, and got this working in uSockets for the first time yesterday. So we are many months away from even beginning to think about what to do about Node.js integration.

But there has always been a difference between what is used in uWebSockets.js and what is used in uWebSockets - one use libuv and the other used raw epoll. So it will most likely just change to being libuv for one and io_uring for the other.

@dalisoft
Copy link

So this implementation only for C++?

@angelsanzn
Copy link

For the sake of completeness: apparently libuv is working on its own implementation of io_uring as a "backend". First proposed in 2018 (libuv/libuv#1947) but movement started very recently (libuv/libuv#3952, libuv/libuv#3979).

@uNetworkingAB
Copy link
Contributor Author

Oh ow, I'm up almost 100k messages per second:

Messages per second: 424716.812500
Messages per second: 425812.031250
Messages per second: 423331.312500
Messages per second: 424703.812500

This used to be 325k or something

@uNetworkingAB
Copy link
Contributor Author

Holy cock!

Req/sec: 469954.250000
Req/sec: 467404.500000
Req/sec: 468516.500000

The latest io_uring Linux "for-next" branch is speedy as all heck

@kolinfluence
Copy link

kolinfluence commented Jun 10, 2023

@uNetworkingAB
just curious, do u think af_xdp will be faster?

your findings reminds me of this benchmark...
https://github.com/pawelgaczynski/gain

@uNetworkingAB
Copy link
Contributor Author

TLDR - AF_XDP is kernel bypassing, we are not doing that here. The target platform for uWS is "vanilla Linux", not bypasses.

It's also hard to tell if that's actually better than Linux networking given that it depends on what drivers your particular NIC has, such as whether it does full bypass on its own or does TLS offloading, etc.

Bypasses drastically reduce the ease of use / applicability to real world deployments in real companies.

@hiqsociety
Copy link

@uNetworkingAB
maybe someday can do speed comparison with multicore of https://github.com/bytedance/monoio for http.
curious how much perf difference vs rust based program is

@uNetworkingAB
Copy link
Contributor Author

I've done some thinking and investigation,

The reason why epoll is faster than io_uring for 8kb and 16kb message sending is because io_uring is utilizing much more memory (epoll uses 1 single receive buffer and 1 single send buffer, while io_uring uses per-socket receive buffers and send buffers).

With more memory usage, you won't fit in CPU-cache - which is what epoll does.

If I run 2 processes with epoll, sending TCP messages back and forth with 300k msg/sec on localhost, it only produces a fraction of that in cache misses. This means that the entire hot path of epoll benchmarking on localhost never even writes to RAM - it's entirely to/from CPU cache.

144 million TCP messages sent and received only produces 2 million cache misses. So the entire "message sending" between the two involved processes never even leaves the CPU cache. It's a total short circuit.

So benchmarking on localhost with epoll / io_uring is very misleading unless you can find a good balance between excessive io_uring calls and memory usage.

Most likely, io_uring requires a CPU with much cache - mine only has 6MB while modern CPUs have 96MB. With more cache, it will be easier to efficiently batch io_uring calls in order to get better performance.

I think that, io_uring will never perform as good as epoll on old CPUs for this reason, as there is not enough CPU cache to queue up enough calls to be faster than the equivalence of multiple calls to send syscall.

I need a better set up (hardware) to get apples-to-apples comparison between epoll and io_uring, probably a 10gbit ethernet card with 2 connectors with a wire connectin the two. That way you gurarantee both solutions must go to NIC

@billywhizz
Copy link

I need a better set up (hardware) to get apples-to-apples comparison between epoll and io_uring, probably a 10gbit ethernet card with 2 connectors with a wire connectin the two. That way you gurarantee both solutions must go to NIC

your analysis makes sense. i have never been able to make io_uring out-perform epoll in local benches and had assumed it was just a skill issue, but maybe not.

re. the 10Gb - techempower have recently upgraded their hardware to 10Gb between the DB/load generator/Web Server, so it might be worth putting it in there with io_uring enabled and not to see if it makes a difference? I dunno if it's enabled by default in the existing node.js/uWebSockets and Bun implementations on TE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants