Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making UDP send latency more predictable (i.e. reducing jitter) (IDFGH-14588) #15345

Open
3 tasks done
h-milz opened this issue Feb 6, 2025 · 13 comments
Open
3 tasks done
Assignees
Labels
Status: Opened Issue is new

Comments

@h-milz
Copy link

h-milz commented Feb 6, 2025

Answers checklist.

  • I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
  • I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
  • I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

Hi all,

my application is a low-latency UDP audio bridge using two ESP32-C5 sample Devkits, but I suppose the question can be answered with ESP32-C6 knowledge. I'm using the 5 GHz band with 11ax with the receiver taking the role as AP and the sender is the only station, with nothing in between and no other device in the same SSID. Wifi encryption is WPA3-SAE. The latency requirement is in the range of 10-12 ms. Basically, I got this to work but there are still occasional pops caused by buffer overruns. The reason turned out to be on the sending side. Datagrams are created and sent (using sendto()) every 1.36 ms triggered by the i2s on_sent callback (I do no use i2s_channel_read() or i2s_channel_write() - they are just too laggy). The UDP payload size is 1452 bytes so that no IP fragmenting takes place (and it is deactivated anyway). (For explanation, each datagram contains 60 samples with 8 slots with 3 byte samples each, summing up to 1440 bytes, plus a 4 byte XOR checksum, a 4 byte packet sequence number, plus a 4 byte data field for other stuff. The sample rate is 44.1 kHz. Each second, 735 datagrams are sent.)

Most of the time, as I can see on the receiver and using wireshark, UDP packets are sent in regular intervals with an acceptable jitter (say, 1360 +/- 500 µs). However, at random intervals, datagrams appear to pile up for 10..20 ms or more before actually being sent, and then the piled up datagrams are sent in a fast burst, like after 12000 µs, 150, 220, 180, 150, ... about 10 or so packets.

On the receiver, I can see that in fact not a single packet is lost, and all packets arrive in sequence. My sequence-number based buffering can cope with arrival delays up to 6 datagrams. But at these occasions, packets are delayed by more than 10 or 15 ms, and this at first dries out the buffer and then overruns it. When the incident is over, everything is back to normal within two more packets.

So the question is, how do I prevent UDP packets from piling up on the sender, or how can I force packets to be sent immediately.

  • I tried to set the FreeRTOS tick rate to 500 Hz, hoping that this would trigger the Wifi send more often, to no avail.
  • Reducing the number of dynamic Tx buffers to 5 and thus forcing a send more often did not help.
  • QoS makes no sense because there are no other packets to begin with.
  • Wifi power saving is disabled.
  • On the sender and the receiver, there is only one user task each, running at priority 15 or so, but competing for CPU time with the WiFi task I suppose. The only other things with higher priorities are the on_sent and on_recv callbacks but they are short and don't take longer than a few µs.
  • I also tried to make the payload smaller by reducing the number of I2S frames to, say, 40 or 50, but this makes the system even less robust against UDP jitter because the interrupt rate is then higher, and each arriving packet has less time for processing. 60 seems to be the sweet spot.
  • Making the receive buffer large enough to cover real outliers would break the latency requirement, and I would end up somewhere between 25 and 30 ms.
  • ESP-NOW I didn't try yet because it is prone to denial of service, lacking authentication - and I don't know if this would solve the problem anyway because ESP-NOW sits on top of the data link layer, like the IP layer, and thus presumably faces the same Wifi Tx buffering issue.
  • I also found a reference to using esp_wifi_internal_tx() but that's related to sending raw ieee80211 data.
  • Also, it's not the maximum throughput. Using the iperf example, I measured a max sustained UDP data rate for a 1440 byte payload of 62 MBit/s between the two boards using MCS7_SGI, and my application uses only about 8.5. The raw UDP packet latency from sendto() to recvfrom() is about 540 +/- 200 µs including the (rare) outliers.

For now, I ran out of other options. Any help is appreciated!

@espressif-bot espressif-bot added the Status: Opened Issue is new label Feb 6, 2025
@github-actions github-actions bot changed the title Making UDP send latency more predictable (i.e. reducing jitter) Making UDP send latency more predictable (i.e. reducing jitter) (IDFGH-14588) Feb 6, 2025
@bryghtlabs-richard
Copy link
Contributor

FYI: if WPA3-SAE has GTK rekeying like WPA2, we routinely see latency of ~1s during this process. If you're in control of the network, you may want to disable this.

Another thing that can go wrong: if a frame is lost, layer2 will often repeat it at a lower speed. If this is happening due to intermittent interference, sometimes layer2 will drop below minimum acceptable bitrate for your application. In this case, it may be better to disable lower radio rates.

How are you recording with WireShark? Is it sitting at the receiver at Layer3? Or is it with another machine capturing at layer2 with radiotap headers? If you have the radiotap headers, you should be able to tell what MCS the radio is doing per packet, as well as acknowledgements.

@h-milz
Copy link
Author

h-milz commented Feb 7, 2025

@bryghtlabs-richard Thank you for the insight but I am afraid I was not able to verify that any of these brings a relief. For one, rekeying takes place every hour or once a day depending on the implementation (no idea how frequently ESP-IDF does that - I assume the code is somewhere in the binary libnet80211.a). Here's an example that I'm currently running: The numbers are the packet arrival times after the previous ones - only those > 4000µs. The nominal value should be ~1360. As you can tell from the log timestamps, this happens rather frequently, but up to 4 missing packets my buffering can cope with it. But sometimes the lag exceeds 6 ms, and in these cases I get these buffer overruns.

W (1663743) wgk_ring_buf: 4705
W (1663751) wgk_ring_buf: 7481
W (1666615) wgk_ring_buf: 4239
W (1667953) wgk_ring_buf: 8157
W (1667959) wgk_ring_buf: 5441
W (1667979) wgk_ring_buf: 19583
W (1671429) wgk_ring_buf: 4021

Interference should not happen. I'm currently using channel 136, and there's no other AP or STA on this channel according to my Wifi router. Anyway, the packet size is always the same, and the radio rate would have to drop to below 75 kBit/s to explain a lag of 20 ms. I didn't manage yet to verify this using Wireshark, though (in radiotap mode).

I checked the arduino-esp32 code to look for clues but the beginPacket and endPacket routines do nothing special - in particular, the flush() function seems to just clear the input buffer - which is not the problem here.

Is there really no way to tell the Wifi stack to send now ?

@rsaxvc
Copy link

rsaxvc commented Feb 8, 2025

For one, rekeying takes place every hour or once a day depending on the implementation (no idea how frequently ESP-IDF does that - I assume the code is somewhere in the binary libnet80211.a).

The last access point I was testing defaulted to rekeying GTK every 5 minutes. It looks like ESP-IDF use wpa_supplicant for SoftAP, and wpa_supplicant defaults to wpa_group_rekey = 10 minutes, but I'm not familiar with how ESP-IDF handles configuring wpa_supplicant.

I'm currently using channel 136, and there's no other AP or STA on this channel according to my Wifi router.

WiFi-6 channels may be up to 160MHz wide, but the 5GHz channel numbers are only 20MHz wide. If you're using >20MHz RF bandwidth, you might evaluate the other channels your link overlaps for usage too. I'll also note that channel 136 is in the shared weather radar band in many countries.

Have you tried running the two devices inside a faraday cage to see if it helps? I've used a steel feed can for this before.

Anyway, the packet size is always the same, and the radio rate would have to drop to below 75 kBit/s to explain a lag of 20 ms.

At the radio layer, WiFi packets will always be at fixed data rates, with the lowest at 1mbps(higher for WiF6). It's likely that something is delaying a packet, rather than the radio rate dropping below this. But hard to analyze without radiotap headers.

Is there really no way to tell the Wifi stack to send now ?

Generally, WiFi radios will Listen Before Talking, the specific mechanism for 802.11 is called CSMA/CA. If a radio transmits without this, it's too easy to break two packets instead of waiting their turn. But CSMA/CA should only impact latency when one device would like to transmit while another is transmitting. Could be an interferer outside of your network, or could be the access point periodically transmitting things like beacon frames.

Another thing you could try is changing the AP beacon rate to see if the problem changes.

AP beacon frames are often sent at a low bitrate , sometimes you can change this, but they can take a couple milliseconds.

It's a neat, challenging application, but at the latency requirement you're looking for, many things could impact it.

@h-milz
Copy link
Author

h-milz commented Feb 8, 2025

Here's an excerpt of a packet capture from the other day. It's in radiotap mode as you can tell, and captures everything on channel 100. The beacon rate is at default (the AP is the other ESP32-C5 - there is no one else). When beacon frames occur, the packet flow continues, unimpressed. The Delta_t column show the time diff to the previous packet. As you can see, everything goes normal, packets get sent at ~1360 µs intervals, then all of a sudden there is a delay (marked blue), then a burst of piled packets is sent, then back to normal. These things occur at apparently random intervals, sometimes twice in 10 seconds, sometimes not at all for many minutes. The delay can be anything between 2 and like 25 ms. The data packets are all sent at the same phy rate of 81 or 86 mbs - not just in this short example, but in general. To me, a WPA rekey or beacon frame or any other external timed event seems not the cause. And yes, no one else is transmitting on this channel.

It's likely that something is delaying a packet,

Yep, that's what I am trying to find out. ATM I see no root cause outside of the sender's Wifi stack. At random intervals, packets copied to the Wifi buffers pile up for an obscure reason, and then all of a sudden all queued packets are sent in a burst.

Image

@rsaxvc
Copy link

rsaxvc commented Feb 8, 2025

At random intervals, packets copied to the Wifi buffers pile up for an obscure reason, and then all of a sudden all queued packets are sent in a burst.

Half of the burst is even sent twice as retries: 53736-53740. Suspicious that your latency occurs near the same time.

@QingzhaoYin
Copy link
Collaborator

@h-milz Could you please upload the packet capture file?

@h-milz
Copy link
Author

h-milz commented Feb 10, 2025

@QingzhaoYin Certainly! Alas, it seems too large to append, so I put it up here. Hope this helps.

As for the timing on the sender - there is only one user task. I set up 4 DMA buffers of the appropriate size, and the I2S on_recv callback notifies the UDP send task (which waits using xTaskNotifyWait()) to pick up the DMA buffer address, which then does some processing (convert the 32 bit format to 24 to save 25% network bandwidth, append a sequence number and a simple XOR checksum) that takes 15-20µs max. Then, the processed buffer is sent to the LWIP stack using sendto() non-blocking, then waits for the next notify. The whole event takes about 25µs and repeats every 1.36 ms. For the entire rest of the time, the Wifi task has the CPU.

In a similar way, there is only one user task on the receiver. The UDP receive task waits for datagrams using recv_from(), and when a packet arrives, it is checked for the correct length and then passed to a ringbuffer insert function which also undoes the sender processing. The processed packet is the put into the Rx buffer according to its sequence number generated in the sender. Every 1.36 ms, the I2S on_sent callback picks up the next packet from the buffer and memcpy()s it to the DMA buffer, which takes 5µs or so. Again, the whole processing from packet reception to buffer insertion takes not more than about 20-25 µs. The remaining CPU time is for the Wifi task.

Thank you! Please let me know if you need anything else!

@QingzhaoYin
Copy link
Collaborator

@h-milz Hi~,based on the packet capture analysis, the packet with SN=3648 was not transmitted promptly at the hardware layer due to some reason (most likely interference), and was sent out 13ms later. When the hardware is unable to send a packet, the packets from the upper-layer app are queued in the Wi-Fi software layer's wait queue. Once the hardware is available (after the previous packet has been successfully sent and the send success interrupt occurs), the packets in the wait queue are transmitted in frame aggregation form (this is reflected in the capture file, where data packets are sent after the RTS/CTS exchange). If possible, you could try again in a shielded room, but this is not a long-term solution since interference will always be present, especially on some commonly used channels. I think you could try implementing some handling on the receiver side.

@h-milz
Copy link
Author

h-milz commented Feb 11, 2025

@QingzhaoYin would it make sense to deactivate A-MPDU to prevent frame aggregation? This application is not about maximising throughput at all, of which is plenty, but about minimising latency.

But then, as you said, if I cannot influence what happens on the sender side, I can try to come up with a better Rx strategy to handle stalls in a way that trades off error correction against latency ...

@QingzhaoYin
Copy link
Collaborator

@h-milz In my opinion, you can try to deactivate AMPDU in menuconfig.

@h-milz
Copy link
Author

h-milz commented Feb 11, 2025

@QingzhaoYin deactivating A-MPDU didn't help, on the contrary. I also tried 11an and 11ac to see if the tx buffering would be any different, but no.

But I noticed another phenomenon. I've been running the boards today for about 7 or 8 hours now, and it appears that during the course of the day, at first there are few and relatively short stalls that my buffering can compensate, but the longer the boards run, the more and the longer stalls occur, and hence pops and glitches. I saw the same issue yesterday. I measured the sender's core temperature yesterday (using the onboard temp sensor) but it seems to rise only from 32.5 to 37.5°C which does not seem enough to trigger HW problems but have you heard of something in this direction? Guess I need to get some cooling spray. If that turns out not to be a coincidence I guess the hardware guys want to have a look before going into mass production because there might be a hot spot somewhere. I'll let you know later today if this makes any difference.

Also, as can be seen e.g. around no 53700 in the pcap file I sent you, normally the receiver sends block acks about 50 µs after received packets but sometimes it does not, and then the sender sends a block ack request, and only then the receiver answers block ack, which leads to a resent packet. In most cases this seems to happen within a 1.36 ms time window. Strange though. Does that ring a bell? But I still think that the problem is on the sender side because no. 53720 is late without any apparent reason.

Thank you!

@h-milz
Copy link
Author

h-milz commented Feb 11, 2025

@QingzhaoYin I did some measurements with the cooling spray. The temperature measurement using the onboard sensor is a bit laggy, mainly due to the metal lid on the WROOM module I suppose, so the logs I took are a bit shaky. But it can be clearly seen that as soon as I cool down the MCU to, say 22,5°C, the error rate caused by delayed packets goes way down, and it is up again when the MCU gets warmer above 30-32.

Which probably means I've got to cool the device under, say, 30°C, which can be a challenge because this is planned as a stage device, where it can get really warm. (see https://github.com/h-milz/wireless-gk to see what I mean).

@h-milz
Copy link
Author

h-milz commented Feb 12, 2025

@QingzhaoYin ok, it seems I found a Rx setting which looks reasonably robust against these sender stalls.

  • re-activating PM reduces the sender's core temp by a degree or two. Not much change but it seems it can't hurt.
  • increased the number of static Tx buffers to 40. Not seen any sender-side ENOMEM errors since.
  • the read pointer offset is set to 3, which means a latency of 9.8 ms end-to-end
  • the Rx buffer now contains 64 60-sample slots in SPIRAM, which can cover 87ms at a sample rate of 44100 Hz without being overrun by a burst following a stall. I've seen up to 30 ms stalls today and it was barely noticable.
  • Instead of looping the last valid sample in case of a stall, I return nulls (i.e. silence), effectively muting the DAC output. This is barely audible up to 10-15 ms, and it will definitely confuse the guitar synth way less than glitches caused by looping.
  • Should the thing ever go into some sort of production, I will make sure I use a dev board with a naked MCU. In this case a small CPU cooler and a fan will be much more efficient than with the metal covered WROOM module.

I could try to tweak some more things like the beacon rate but this seems not to be of much influence.

Been listening music now for hours but this is so far the best. And the measured latency is cosmic.

In any case, Espressif may want to check if the temperature rise in fact causes these stalls. It's easy to test: simply send 700-800 1450 byte payloads per second over an hour or so and measure the packet rate on Wifi. MCS7_SGI apparently works fine with 11an, 11ac and 11ax on 5 GHz but there seems not to be much difference for the scenario above.

Thanks for now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Opened Issue is new
Projects
None yet
Development

No branches or pull requests

6 participants