Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in GRO code for Windows? #2041

Open
thomaseizinger opened this issue Nov 14, 2024 · 42 comments
Open

Potential bug in GRO code for Windows? #2041

thomaseizinger opened this issue Nov 14, 2024 · 42 comments

Comments

@thomaseizinger
Copy link
Contributor

We are seeing an issue on a Windows machine where quinn-udp gives us packets larger than the configured MTU (2540 and 3810). Here is how we split the packets: https://github.com/firezone/firezone/blob/5437c3e2df001c8c491749c7bff0c063ccdf39e1/rust/socket-factory/src/lib.rs#L210-L255

This code has been working on Linux for several months and we have also not had reports that it doesn't work on Windows. Recently however a customer filed a bug report and looking through the logs, I discovered that we receive packets (i.e. DatagramIn in the above snippet) that are much larger than expected.

Going through quinn-udp's source code, I developed two theories and would be interested in hearing your thoughts on it:

  1. stride is being read incorrectly (endianess-issue?) and instead of being smaller than len, it is bigger than len, meaning we never end up splitting the packets despite it being coalesced.
    stride = unsafe { cmsg::decode::<u32, WinSock::CMSGHDR>(cmsg) };
  2. The buffer for control messages is too small, thus Windows fails to fit the UDP_COALESCED_INFO message into it and can't tell us that the buffer is coalesced even though it is.

Unfortunately, I've not been able to reproduce this on my own hardware :(

@stormshield-damiend
Copy link
Contributor

Does the windows run on intel CPU or on ARM CPU ?

@mxinden
Copy link
Collaborator

mxinden commented Nov 14, 2024

We see a similar issue on Windows on ARM, see summary in last comment on this thread.

While I myself don't have access to a Windows on ARM machine, someone at Mozilla does and is able to reproduce the issue. We have not been able to reproduce this on a Windows on ARM VM.

The buffer for control messages is too small, thus Windows fails to fit the UDP_COALESCED_INFO message into it and can't tell us that the buffer is coalesced even though it is.

Interesting. I will try reproducing the issue with a larger buffer. For what it is worth, here is MSQUIC's buffer length.

@thomaseizinger
Copy link
Contributor Author

Does the windows run on intel CPU or on ARM CPU ?

We see this issue on an x64_86 CPU.

@thomaseizinger
Copy link
Contributor Author

The buffer for control messages is too small, thus Windows fails to fit the UDP_COALESCED_INFO message into it and can't tell us that the buffer is coalesced even though it is.

Interesting. I will try reproducing the issue with a larger buffer. For what it is worth, here is MSQUIC's buffer length.

I thought about this a bit more and I am no longer sure that is a viable possibility unless we receive a lot of other control messages that we don't match on.

@stormshield-damiend
Copy link
Contributor

Could the receiver interface be a dual stack (IPv4 and IPv6) one ? I had no chance to test that case when writing the code, maybe we receive ipv4 + ipv6 control messages ?

@thomaseizinger
Copy link
Contributor Author

We see a similar issue on Windows on ARM, see summary in last comment on this thread.

Interesting, according to https://bugzilla.mozilla.org/show_bug.cgi?id=1916558#c43, the UDP_COALESCED_INFO never gets received. We only get IP_PKTINFO and IP_ECN.

@thomaseizinger
Copy link
Contributor Author

In https://learn.microsoft.com/en-us/windows-hardware/drivers/network/udp-rsc-offload, it is mentioned that this only works on Windows 11, version 24H2 and later. I'll have to investigate what version our customer is using.

@thomaseizinger
Copy link
Contributor Author

thomaseizinger commented Nov 14, 2024

Could the receiver interface be a dual stack (IPv4 and IPv6) one ? I had no chance to test that case when writing the code, maybe we receive ipv4 + ipv6 control messages ?

I can see in our logs that this customer did have dual-stack but I don't think it matters here because the way we do dual-stack is by having two sockets, one for IPv4 and one for IPv6.

@stormshield-damiend
Copy link
Contributor

This code work fine on our side on Windows 10 and 11.

@stormshield-damiend
Copy link
Contributor

I do not see any special case related to this in msquic C code https://github.com/microsoft/msquic/blob/main/src/platform/datapath_winuser.c#L3216

@thomaseizinger
Copy link
Contributor Author

@mxinden Do you know by any chance if the affected user is using a dock? We have reports that this only happens when the user is connected via a Dell dock WD19S. It works fine over WiFi and via a separate Ethernet dongle.

@stormshield-damiend
Copy link
Contributor

@mxinden could your colleague print the stride value (at L210) before the loop ? it is init with the length of the received packet to handle the case were received packet is not coalesced ?

@saschanaz
Copy link

The colleague here 👋

I'm not using any dock but only the builtin WiFi, and the stride value is 1280.

@mxinden
Copy link
Collaborator

mxinden commented Nov 14, 2024

and the stride value is 1280.

Just to avoid any confusion, the stride value is 1280 AND the datagram length value is 1280. This is with the unit test I provided in this commit, see also the logs here. That said, I have the suspicion that the unit test simply doesn't force datagram coalescing in the kernel, see this bugzilla comment.

Looking at pcaps with fosstodon.org that @saschanaz provided, we see >1_500k UDP datagrams over an internet path. >1_500k over an internet path is very unlikely. My suspicion is, that the Windows kernel correctly coalesces the UDP datagrams (< 1_500k) into one >1_500k datagram, but then either does not provide the stride length, or we fail to read the stride length.

@stormshield-damiend
Copy link
Contributor

I think you wanted to link this commit mxinden@7ce09c3

@stormshield-damiend
Copy link
Contributor

We have a difference with msquic, they set the max segment size to "u16:MAX - 8", we set it to "u16:MAX", maybe you can try with this (i doubt it will change anything but who knows) https://github.com/microsoft/msquic/blob/e3ffc71a375d85b1216348b2ffa1e13653213e26/src/platform/datapath_winuser.c#L1496

@mxinden
Copy link
Collaborator

mxinden commented Nov 14, 2024

@saschanaz when you have time, could you (1) apply the following git diff to your local mozilla-central checkout, (2) build in debug mode, and (3) try whether you can reproduce the issues with fosstodon.org. If you can reproduce the issue, can you provide logs and a pcap?

(I wrote the above against 79808865c472858740158ce5cbbf4e0b002cf848, but any commit from today on bookmarks/central should work.)

Also see past Bugzilla comment for log level.

diff --git a/third_party/rust/quinn-udp/src/windows.rs b/third_party/rust/quinn-udp/src/windows.rs
index 886f455117ceb..054e267d97921 100644
--- a/third_party/rust/quinn-udp/src/windows.rs
+++ b/third_party/rust/quinn-udp/src/windows.rs
@@ -114,7 +114,7 @@ impl UdpSocketState {
             // u32 per
             // https://learn.microsoft.com/en-us/windows/win32/winsock/ipproto-udp-socket-options.
             // Choice of 2^16 - 1 inspired by msquic.
-            u16::MAX as u32,
+            u16::MAX as u32 - 8,
         );
 
         let now = Instant::now();
@@ -211,6 +211,7 @@ impl UdpSocketState {
 
         let cmsg_iter = unsafe { cmsg::Iter::new(&wsa_msg) };
         for cmsg in cmsg_iter {
+            error!("cmsg_iter {} {}", cmsg.cmsg_level, cmsg.cmsg_type);
             const UDP_COALESCED_INFO: i32 = WinSock::UDP_COALESCED_INFO as i32;
             // [header (len)][data][padding(len + sizeof(data))] -> [header][data][padding]
             match (cmsg.cmsg_level, cmsg.cmsg_type) {
@@ -251,6 +252,7 @@ impl UdpSocketState {
             ecn: EcnCodepoint::from_bits(ecn_bits as u8),
             dst_ip,
         };
+        error!("{:?}", meta[0]);
         Ok(1)
     }
 
@@ -398,7 +400,7 @@ fn set_socket_option(
 
 pub(crate) const BATCH_SIZE: usize = 1;
 // Enough to store max(IP_PKTINFO + IP_ECN, IPV6_PKTINFO + IPV6_ECN) + max(UDP_SEND_MSG_SIZE, UDP_COALESCED_INFO) bytes (header + data) and some extra margin
-const CMSG_LEN: usize = 128;
+const CMSG_LEN: usize = 512;
 const OPTION_ON: u32 = 1;
 
 // FIXME this could use [`std::sync::OnceLock`] once the MSRV is bumped to 1.70 and upper

@stormshield-damiend the diff above contains your suggestion, namely to use u16::MAX - 8.

@thomaseizinger this includes your suspicion that CMSG_LEN might be too small.

Thank you for the help everyone!

@saschanaz
Copy link

 2:34.94 error: the listed checksum of `D:\gecko\third_party\rust\quinn-udp\src/windows.rs` has changed:
 2:34.95 expected: ab1928d18bed62162a0f2c96158d808d7a2962045ab47c9efa0ecf60e2a2c060
 2:34.95 actual:   3a894014bd45e81f7617efe6e3d1c1c2e07a75da68646e7cc8dc447cc984c522
 2:34.95 directory sources are not intended to be edited, if modifications are required then it is recommended that `[patch]` is used with a forked copy of the source

I guess I need a fork to make that work?

@saschanaz
Copy link

Sent! It still showed a lot of dropped packets with decryption failure though.

@thomaseizinger
Copy link
Contributor Author

More theories:

  • Do we have a bug in cmsg::Iter and fail to decode the last control message? Perhaps printing the raw bytes sitting in the cmesg buffer might be helpful.

  • We have a difference with msquic, they set the max segment size to "u16:MAX - 8", we set it to "u16:MAX"

    The microsoft docs recommend to use WSASetUdpRecvMaxCoalesced instead of the socket option. Maybe that is worth a try?

  • This code work fine on our side on Windows 10 and 11.

    And you see coalesced packets on a regular basis?

@thomaseizinger
Copy link
Contributor Author

I managed to set up a Windows 11 VM that uses USB passthrough to a dedicated network card and I can confirm that GRO works there. Any ideas on what I could try to reproduce the issue are welcome :)

@stormshield-damiend
Copy link
Contributor

More theories:

* Do we have a bug in `cmsg::Iter` and fail to decode the last control message? Perhaps printing the raw bytes sitting in the cmesg buffer might be helpful.

* > We have a difference with msquic, they set the max segment size to "u16:MAX - 8", we set it to "u16:MAX"
  
  
  The microsoft docs recommend to use [`WSASetUdpRecvMaxCoalesced`](https://learn.microsoft.com/en-us/windows/win32/api/ws2tcpip/nf-ws2tcpip-wsasetudprecvmaxcoalescedsize) instead of the socket option. Maybe that is worth a try?

msquic does it with a standard setsockeopt call

* > This code work fine on our side on Windows 10 and 11.
  
  
  And you see coalesced packets on a regular basis?

Yes we do, but we did not tested in Windows ARM only Intel.

@mxinden
Copy link
Collaborator

mxinden commented Nov 15, 2024

Sent! It still showed a lot of dropped packets with decryption failure though.

Thank you @saschanaz for the thorough work!

For others, with the above diff the issue still persists. The pcap shows UDP datagrams with up to 20_000 bytes length over an internet link. As expected, the contained QUIC packet(s) fail to decrypt, given that multiple QUIC packets with short headers are disguised as a single large QUIC packet.

@mxinden
Copy link
Collaborator

mxinden commented Nov 15, 2024

  • This code work fine on our side on Windows 10 and 11.

    And you see coalesced packets on a regular basis?
    

For what it is worth, Firefox Nightly does URO on Windows. While we don't see many coalesced receives, we do see some, e.g. in the 99th percentile 8 or more.

image

https://glam.telemetry.mozilla.org/fog/probe/networking_http_3_udp_datagram_segments_received/explore?aggType=avg&os=Windows&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

@djc
Copy link
Member

djc commented Nov 15, 2024

@nibanks would you happen to have any guidance/hot tips for this stuff?

@nibanks
Copy link

nibanks commented Nov 15, 2024

Is there a specific problem you need help with? I don't have time to go through the whole thread here.

@mxinden
Copy link
Collaborator

mxinden commented Nov 15, 2024

@nibanks summary:

Note that this is the same issue as the one discussed on Slack / e-mail last week, see mail UDP Receive Segment Coalescing Offload (URO) on Windows on ARM.

@saschanaz
Copy link

Per my understanding the URO thing requires 24H2, right? We probably need to make sure that when requesting repro.

@thomaseizinger
Copy link
Contributor Author

@mxinden Have you checked what the return value of

_ = set_socket_option(
&*socket.0,
WinSock::IPPROTO_UDP,
WinSock::UDP_RECV_MAX_COALESCED_SIZE,
// u32 per
// https://learn.microsoft.com/en-us/windows/win32/winsock/ipproto-udp-socket-options.
// Choice of 2^16 - 1 inspired by msquic.
u16::MAX as u32,
);
is on the problematic machine?

Maybe this actually fails but leaves the driver in a kind of limbo state and we should attempt to unset this on failure?

Also, what do people think of adding a safe-guard mechanism like we do for GSO? If the read packet exceeds the interface MTU (I saw y'all have been collaborating on https://github.com/mozilla/mtu), we attempt to unset the socket option.

@Ralith
Copy link
Collaborator

Ralith commented Nov 19, 2024

If the read packet exceeds the interface MTU we attempt to unset the socket option.

It sounds like that would make sense as a last-ditch mitigation, though it'd be nice to understand the root cause better, especially since it sounds like other software might be dodging this issue somehow. Finding the right interface MTU to refer to might also be complex in practice, since it may change over time and/or depending on address.

mxinden added a commit to mxinden/quinn that referenced this issue Nov 23, 2024
mxinden added a commit to mxinden/h3 that referenced this issue Nov 23, 2024
mxinden added a commit to mxinden/h3 that referenced this issue Nov 23, 2024
@mxinden
Copy link
Collaborator

mxinden commented Nov 23, 2024

I can now consistently reproduce the bug. It is tied to WSL. Installing WSL makes it occur. Disabling WSL "fixes" the issue.

Here is a smallish reproducer.

On a Windows on ARM device:

  1. Install Windows subsystem for Linux in a terminal via wsl --install. Note that we are not actually going to use it.
  2. Restart Windows.
  3. Checkout https://github.com/mxinden/h3.git branch reproduce-quinn-2041 e.g. using Powershell, no need to use WSL.
  4. Run cargo run --example client -- https://fosstodon.org.
  5. You will see the following panic Got un-coalesced UDP datagram > 1_500 bytes.

Once reproduced, you can "fix" the bug by disabling WSL:

  1. Control Panel -> Windows features -> turn off "Virtual Machine Platform".
  2. Restart Windows.
  3. Run cargo run --example client -- https://fosstodon.org.
  4. See no panic.

@stormshield-damiend do you have an idea why it might fail with WSL installed, but not without?

Note that I am using https://fosstodon.org here which is using Fastly, which consistently sends packet trains (2 or more UDP datagrams back-to-back), which are then coalesced by the client OS. This is not tied to https://fosstodon.org or Fastly.

@thomaseizinger
Copy link
Contributor Author

Very interesting! Does that also reproduce in a VM? Or only on real hardware?

@mxinden
Copy link
Collaborator

mxinden commented Nov 23, 2024

Does that also reproduce in a VM?

We were not able to reproduce on the following machines, despite WSL being installed:

  • Windows on ARM VM running on an ARM MacBook
  • Windows X86_64 running natively on a x86_64 CPU

@stormshield-damiend
Copy link
Contributor

I have no idea why this fails when WSL is installed

mxinden added a commit to mxinden/quinn that referenced this issue Nov 28, 2024
On Windows on ARM with Windows Subsystem for Linux (WSL) `WSA_RECVMSG` does not
return the segment size of coalesced UDP datagrams. See
quinn-rs#2041 for details.

While lacking a fix for the root cause, don't enable URO, i.e. don't coalesce
UDP datagrams on Windows on ARM.
mxinden added a commit to mxinden/quinn that referenced this issue Nov 28, 2024
On Windows on ARM with Windows Subsystem for Linux (WSL) `WSA_RECVMSG` does not
return the segment size of coalesced UDP datagrams. See
quinn-rs#2041 for details.

While lacking a fix for the root cause, don't enable URO, i.e. don't coalesce
UDP datagrams on Windows on ARM.
@mxinden
Copy link
Collaborator

mxinden commented Nov 28, 2024

For the record, I am proposing to disable URO on Windows on ARM, at least for now, see #2071.

While I can consistently reproduce the bug (see #2041 (comment)), I don't know how to fix the root cause itself.

@larseggert
Copy link
Contributor

@nibanks ^^^

@nibanks
Copy link

nibanks commented Nov 28, 2024

We'll take a look after the Thanksgiving break. Everyone's out right now.

mxinden added a commit to mxinden/quinn that referenced this issue Nov 28, 2024
On Windows on ARM with Windows Subsystem for Linux (WSL) `WSA_RECVMSG` does not
return the segment size of coalesced UDP datagrams. See
quinn-rs#2041 for details.

While lacking a fix for the root cause, don't enable URO, i.e. don't coalesce
UDP datagrams on Windows on ARM.
mxinden added a commit to mxinden/quinn that referenced this issue Nov 28, 2024
On Windows on ARM with Windows Subsystem for Linux (WSL) `WSA_RECVMSG` does not
return the segment size of coalesced UDP datagrams. See
quinn-rs#2041 for details.

While lacking a fix for the root cause, don't enable URO, i.e. don't coalesce
UDP datagrams on Windows on ARM.
github-merge-queue bot pushed a commit that referenced this issue Dec 6, 2024
On Windows on ARM with Windows Subsystem for Linux (WSL) `WSA_RECVMSG` does not
return the segment size of coalesced UDP datagrams. See
#2041 for details.

While lacking a fix for the root cause, don't enable URO, i.e. don't coalesce
UDP datagrams on Windows on ARM.
@thomaseizinger
Copy link
Contributor Author

For the record, I am proposing to disable URO on Windows on ARM, at least for now, see #2071.

While I can consistently reproduce the bug (see #2041 (comment)), I don't know how to fix the root cause itself.

I'd like to propose an additional (automated?) way to disable URO to workaround this. We got another telemetry alert where we received packets on a machine that are bigger than the MTU. I'd like to either have an API where I can disable URO in that case or have quinn-udp disable it for me automatically.

Whilst this check might not detect it in every case as there might be UDP packets read in a batch where they are in total less than the MTU, chances are there will be at least a few packets that maxed out the MTU and are read in a single batch by the driver. In this particular bug report, there were 15 of those.

Thoughts?

mxinden added a commit to mxinden/quinn that referenced this issue Dec 16, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- quinn-rs#2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
@mxinden
Copy link
Collaborator

mxinden commented Dec 16, 2024

Thank you for the additional data @thomaseizinger. Very helpful!

I'd like to either have an API where I can disable URO in that case or have quinn-udp disable it for me automatically.

The above heuristic, i.e. to disable URO on >1500 bytes, is in my eyes too fragile to apply across a broad spectrum of use-cases. E.g. in a web-context (latency critical) it might delay connection establishment till the first >1500 bytes packet train is received. Thus I suggest disabling URO by default and adding, as you suggest above, an API to optionally enable it.

Implemented in #2092. What do folks think?

@thomaseizinger
Copy link
Contributor Author

it might delay connection establishment

Slightly OT but why do you think that? Does the initial handshake not fit into a UDP packet? Why would the initial handshake send multiple packets that are not filled to the MTU?

There will still be a delay of course because the coalesced packets needs to be retransmitted. I would expect the actual handshake to succeed though. At least that was what we saw in the logs: Connection established fine and then "breaks" (once the first packet train arrives).

mxinden added a commit to mxinden/quinn that referenced this issue Dec 16, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- quinn-rs#2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
mxinden added a commit to mxinden/quinn that referenced this issue Dec 16, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- quinn-rs#2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
mxinden added a commit to mxinden/quinn that referenced this issue Dec 16, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- quinn-rs#2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
mxinden added a commit to mxinden/quinn that referenced this issue Dec 16, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- quinn-rs#2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
mxinden added a commit to mxinden/quinn that referenced this issue Dec 17, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- quinn-rs#2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
github-merge-queue bot pushed a commit that referenced this issue Dec 17, 2024
We have multiple bug reports for URO on Windows, including non-ARM. See:

- #2041
- https://bugzilla.mozilla.org/show_bug.cgi?id=1916558

Instead of enabling GRO on Windows by default, this commit changes the default
to off, adding a `set_gro` to optionally enable it.
github-merge-queue bot pushed a commit to firezone/firezone that referenced this issue Dec 19, 2024
This release disables URO/GRO on Windows entirely due to hardware /
driver bugs.

Related: quinn-rs/quinn#2041.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants