-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making UDP send latency more predictable (i.e. reducing jitter) (IDFGH-14588) #15345
Comments
FYI: if WPA3-SAE has GTK rekeying like WPA2, we routinely see latency of ~1s during this process. If you're in control of the network, you may want to disable this. Another thing that can go wrong: if a frame is lost, layer2 will often repeat it at a lower speed. If this is happening due to intermittent interference, sometimes layer2 will drop below minimum acceptable bitrate for your application. In this case, it may be better to disable lower radio rates. How are you recording with WireShark? Is it sitting at the receiver at Layer3? Or is it with another machine capturing at layer2 with radiotap headers? If you have the radiotap headers, you should be able to tell what MCS the radio is doing per packet, as well as acknowledgements. |
@bryghtlabs-richard Thank you for the insight but I am afraid I was not able to verify that any of these brings a relief. For one, rekeying takes place every hour or once a day depending on the implementation (no idea how frequently ESP-IDF does that - I assume the code is somewhere in the binary
Interference should not happen. I'm currently using channel 136, and there's no other AP or STA on this channel according to my Wifi router. Anyway, the packet size is always the same, and the radio rate would have to drop to below 75 kBit/s to explain a lag of 20 ms. I didn't manage yet to verify this using Wireshark, though (in radiotap mode). I checked the arduino-esp32 code to look for clues but the beginPacket and endPacket routines do nothing special - in particular, the flush() function seems to just clear the input buffer - which is not the problem here. Is there really no way to tell the Wifi stack to send now ? |
The last access point I was testing defaulted to rekeying GTK every 5 minutes. It looks like ESP-IDF use wpa_supplicant for SoftAP, and wpa_supplicant defaults to wpa_group_rekey = 10 minutes, but I'm not familiar with how ESP-IDF handles configuring wpa_supplicant.
WiFi-6 channels may be up to 160MHz wide, but the 5GHz channel numbers are only 20MHz wide. If you're using >20MHz RF bandwidth, you might evaluate the other channels your link overlaps for usage too. I'll also note that channel 136 is in the shared weather radar band in many countries. Have you tried running the two devices inside a faraday cage to see if it helps? I've used a steel feed can for this before.
At the radio layer, WiFi packets will always be at fixed data rates, with the lowest at 1mbps(higher for WiF6). It's likely that something is delaying a packet, rather than the radio rate dropping below this. But hard to analyze without radiotap headers.
Generally, WiFi radios will Listen Before Talking, the specific mechanism for 802.11 is called CSMA/CA. If a radio transmits without this, it's too easy to break two packets instead of waiting their turn. But CSMA/CA should only impact latency when one device would like to transmit while another is transmitting. Could be an interferer outside of your network, or could be the access point periodically transmitting things like beacon frames. Another thing you could try is changing the AP beacon rate to see if the problem changes. AP beacon frames are often sent at a low bitrate , sometimes you can change this, but they can take a couple milliseconds. It's a neat, challenging application, but at the latency requirement you're looking for, many things could impact it. |
Here's an excerpt of a packet capture from the other day. It's in radiotap mode as you can tell, and captures everything on channel 100. The beacon rate is at default (the AP is the other ESP32-C5 - there is no one else). When beacon frames occur, the packet flow continues, unimpressed. The Delta_t column show the time diff to the previous packet. As you can see, everything goes normal, packets get sent at ~1360 µs intervals, then all of a sudden there is a delay (marked blue), then a burst of piled packets is sent, then back to normal. These things occur at apparently random intervals, sometimes twice in 10 seconds, sometimes not at all for many minutes. The delay can be anything between 2 and like 25 ms. The data packets are all sent at the same phy rate of 81 or 86 mbs - not just in this short example, but in general. To me, a WPA rekey or beacon frame or any other external timed event seems not the cause. And yes, no one else is transmitting on this channel.
Yep, that's what I am trying to find out. ATM I see no root cause outside of the sender's Wifi stack. At random intervals, packets copied to the Wifi buffers pile up for an obscure reason, and then all of a sudden all queued packets are sent in a burst. |
Half of the burst is even sent twice as retries: 53736-53740. Suspicious that your latency occurs near the same time. |
@h-milz Could you please upload the packet capture file? |
@QingzhaoYin Certainly! Alas, it seems too large to append, so I put it up here. Hope this helps. As for the timing on the sender - there is only one user task. I set up 4 DMA buffers of the appropriate size, and the I2S In a similar way, there is only one user task on the receiver. The UDP receive task waits for datagrams using Thank you! Please let me know if you need anything else! |
@h-milz Hi~,based on the packet capture analysis, the packet with SN=3648 was not transmitted promptly at the hardware layer due to some reason (most likely interference), and was sent out 13ms later. When the hardware is unable to send a packet, the packets from the upper-layer app are queued in the Wi-Fi software layer's wait queue. Once the hardware is available (after the previous packet has been successfully sent and the send success interrupt occurs), the packets in the wait queue are transmitted in frame aggregation form (this is reflected in the capture file, where data packets are sent after the RTS/CTS exchange). If possible, you could try again in a shielded room, but this is not a long-term solution since interference will always be present, especially on some commonly used channels. I think you could try implementing some handling on the receiver side. |
@QingzhaoYin would it make sense to deactivate A-MPDU to prevent frame aggregation? This application is not about maximising throughput at all, of which is plenty, but about minimising latency. But then, as you said, if I cannot influence what happens on the sender side, I can try to come up with a better Rx strategy to handle stalls in a way that trades off error correction against latency ... |
@h-milz In my opinion, you can try to deactivate AMPDU in menuconfig. |
@QingzhaoYin deactivating A-MPDU didn't help, on the contrary. I also tried 11an and 11ac to see if the tx buffering would be any different, but no. But I noticed another phenomenon. I've been running the boards today for about 7 or 8 hours now, and it appears that during the course of the day, at first there are few and relatively short stalls that my buffering can compensate, but the longer the boards run, the more and the longer stalls occur, and hence pops and glitches. I saw the same issue yesterday. I measured the sender's core temperature yesterday (using the onboard temp sensor) but it seems to rise only from 32.5 to 37.5°C which does not seem enough to trigger HW problems but have you heard of something in this direction? Guess I need to get some cooling spray. If that turns out not to be a coincidence I guess the hardware guys want to have a look before going into mass production because there might be a hot spot somewhere. I'll let you know later today if this makes any difference. Also, as can be seen e.g. around no 53700 in the pcap file I sent you, normally the receiver sends block acks about 50 µs after received packets but sometimes it does not, and then the sender sends a block ack request, and only then the receiver answers block ack, which leads to a resent packet. In most cases this seems to happen within a 1.36 ms time window. Strange though. Does that ring a bell? But I still think that the problem is on the sender side because no. 53720 is late without any apparent reason. Thank you! |
@QingzhaoYin I did some measurements with the cooling spray. The temperature measurement using the onboard sensor is a bit laggy, mainly due to the metal lid on the WROOM module I suppose, so the logs I took are a bit shaky. But it can be clearly seen that as soon as I cool down the MCU to, say 22,5°C, the error rate caused by delayed packets goes way down, and it is up again when the MCU gets warmer above 30-32. Which probably means I've got to cool the device under, say, 30°C, which can be a challenge because this is planned as a stage device, where it can get really warm. (see https://github.com/h-milz/wireless-gk to see what I mean). |
@QingzhaoYin ok, it seems I found a Rx setting which looks reasonably robust against these sender stalls.
I could try to tweak some more things like the beacon rate but this seems not to be of much influence. Been listening music now for hours but this is so far the best. And the measured latency is cosmic. In any case, Espressif may want to check if the temperature rise in fact causes these stalls. It's easy to test: simply send 700-800 1450 byte payloads per second over an hour or so and measure the packet rate on Wifi. MCS7_SGI apparently works fine with 11an, 11ac and 11ax on 5 GHz but there seems not to be much difference for the scenario above. Thanks for now! |
Answers checklist.
General issue report
Hi all,
my application is a low-latency UDP audio bridge using two ESP32-C5 sample Devkits, but I suppose the question can be answered with ESP32-C6 knowledge. I'm using the 5 GHz band with 11ax with the receiver taking the role as AP and the sender is the only station, with nothing in between and no other device in the same SSID. Wifi encryption is WPA3-SAE. The latency requirement is in the range of 10-12 ms. Basically, I got this to work but there are still occasional pops caused by buffer overruns. The reason turned out to be on the sending side. Datagrams are created and sent (using
sendto()
) every 1.36 ms triggered by the i2son_sent
callback (I do no usei2s_channel_read()
ori2s_channel_write()
- they are just too laggy). The UDP payload size is 1452 bytes so that no IP fragmenting takes place (and it is deactivated anyway). (For explanation, each datagram contains 60 samples with 8 slots with 3 byte samples each, summing up to 1440 bytes, plus a 4 byte XOR checksum, a 4 byte packet sequence number, plus a 4 byte data field for other stuff. The sample rate is 44.1 kHz. Each second, 735 datagrams are sent.)Most of the time, as I can see on the receiver and using wireshark, UDP packets are sent in regular intervals with an acceptable jitter (say, 1360 +/- 500 µs). However, at random intervals, datagrams appear to pile up for 10..20 ms or more before actually being sent, and then the piled up datagrams are sent in a fast burst, like after 12000 µs, 150, 220, 180, 150, ... about 10 or so packets.
On the receiver, I can see that in fact not a single packet is lost, and all packets arrive in sequence. My sequence-number based buffering can cope with arrival delays up to 6 datagrams. But at these occasions, packets are delayed by more than 10 or 15 ms, and this at first dries out the buffer and then overruns it. When the incident is over, everything is back to normal within two more packets.
So the question is, how do I prevent UDP packets from piling up on the sender, or how can I force packets to be sent immediately.
on_sent
andon_recv
callbacks but they are short and don't take longer than a few µs.esp_wifi_internal_tx()
but that's related to sending raw ieee80211 data.sendto()
torecvfrom()
is about 540 +/- 200 µs including the (rare) outliers.For now, I ran out of other options. Any help is appreciated!
The text was updated successfully, but these errors were encountered: