-
Notifications
You must be signed in to change notification settings - Fork 28
Rekeying failure on a busy link #42
Comments
I think this is related to #39 |
I have not seen the messages: Reading back I suppose I should have attached a log from one of the nodes as well. Its attached now. I just noticed your commit, I will test again with the latest master tomorrow morning. |
This appears to be a long-standing issue with ath9k. I've submitted this as a bug to ath9k mailing list last year ( https://www.mail-archive.com/ath9k-devel%40lists.ath9k.org/msg13595.html ) and there was an earlier report of very similar issues ( http://lists.shmoo.com/pipermail/hostap/2014-November/031377.html ). I was not able to get any traction. I've gone as far as reading the key back from the card registers and it matches what's expected. Our workaround was to have a unicast probe between the nodes that occurs right after rekey and, if it fails, rekey. |
On 23/06/16 18:10, Alexis Green wrote:
Would you mind sharing the code of your workaround? |
The code is pretty awful looking but I'll see if I can button it up this On Fri, Jun 24, 2016 at 6:10 AM, Ferry Huberts [email protected]
|
@alexgrin: I was looking at the workaround you described, I assume that you use the NL80211_CMD_PROBE_CLIENT netlink call for this? This would probably require a small patch in net/wireless/nl80211.c to make this work (this call is only allowed for AP and P2P interfaces, by default). Or do you probe differently? |
Nope, it's nowhere near as awesome as you think. Authsae does a multicast ping (layer3) after rekey and waits to hear a unicast response from the a device with MAC address of the peer we just rekeyed with. If there's no response, rekey is triggered. You have to specify the interface for multicast to the daemon for this to work. It's a pretty nasty hackjob and I'll post the code as is (-ish) soon. |
Here's the yuckyness - uniumwifi@a1591d3 |
thanks for sharing! |
Thank you for the patch! I'll have a look at it. Sorry for the radio silence, was testing a patch which may help find a solution. Would someone mind checking it out and maybe suggesting a better approach? |
I've posted the issue on the ath9k-devel list as well, hopefully I can stir something up/get the help to address this. https://lists.ath9k.org/pipermail/ath9k-devel/2016-July/014676.html |
So far 0 response, tried to bump it one time with no effect. For the forseeable future, I've chosen to use software encryption. |
Please note that your maximum achievable throughput will degrade if using On Wed, Jul 13, 2016 at 11:41 PM, MichelStam [email protected]
|
Hello Chun-Yeow, I agree, there is a measurable performance drop of about 2 Mbps. Luckily, for this particular application, high bandwidth is not the most important, but link stability is. On a side note, I seem to have gotten a little traction on the ath9k-devel list; Adrian Chadd has taken a look at the Atheros reference driver, which seems to have a fix called ATH_SUPPORT_KEYPLUMB_WAR. This reinserts the key when there's Rx decryption errors. Is it maybe an idea for those of you that have run into this issue as well at some point or the other to pitch in on the ath9k-devel list? Cheers, Michel |
Can you point me to that thread? It's very relevant for our deployment as well |
Of course, |
tnx |
I recently got a mail from Sven Eckelman about a patch which may solve the situation: I have not yet had time to take a look at this, so caveat emptor. |
thanks. |
Is that patch being upstreamed? |
No it is not. Sven add the following to the message:
Here is the original email: |
I just contacted Antonio via email, offering help in upstreaming. |
Sure, I was planning to do this somewhere in the coming days. Maybe I can re-use some of the kludge I wrote up to get around the use of internals (unless I am doing that myself as well). |
shall we continue via direct email? mailings (at) hupie (dot) com |
After spending 2 weeks on this issue together with Ferry Huberts, we did not get much further. We tried:
It seems to me that the chip gets very confused when a key is installed while it is processing a lot of traffic. Quieting the chip does not seem to help, unless I did not get it quiet enough. I have attached the various patches as an example of what was tried.
I lost the patch which quiets the chip prior to keying; accidental delete .... Then, after the switch statement, add: Adrian Chadd has suggested in an email to try my original buckshot patch ath9k-install_key-buckshot.diff.txt, but this time reinstall keys after the reset. I will try to find some time and do this, see if it helps. Cheers, Michel |
Yes, I have never seen hardware behave so baffling and I suspect we will not get any further unless we get more information on what is really going on. very very unfortunate. |
Ok. So after a few valiant attempts, I got a little further, but still nowhere close to a working patch. From a bug which is triggered every rekey, I'm now at a situation where the error usually every couple of minutes. The max I got was 500 seconds on every 60 seconds rekeying. Another thing which significantly delayed my progress is the sheer amount of locking in the ath9k driver. I needed to grab the rtnl_lock in order to access the key material in mac80211, but sc->mutex also seems to be required. Grabbing both is inviting all sorts of locking issues which usually result in the whole network stack hanging. As a quick hack, I stopped using sc->mutex and grabbed only rtnl_lock (not doing so will cause a BUG_ON every rekey). Please take a close look at this patch. It is by no means complete or clean yet, so not ready for production. |
I do have a fix in authsae (rekeying) but need to test it further |
My rekey code works well. |
One question, why do you use authsae rather than wpa_supplicant, I'd think wpa_supplicant is much more widely used? |
On Sat, Apr 22, 2017 at 12:14:59AM -0700, Xuebing Wang wrote:
One question, why do you use authsae rather than wpa_supplicant, I'd think wpa_supplicant is much more widely used?
At the time authsae was created, wpa_supplicant didn't support SAE.
Now it does -- actually cozybit contributed that support for wpa_supplicant.
|
Personally, I had some issues with wpa_supplicant in combination with OpenWRT. Some race condition which prevented either the AP or mesh function from working. Did not have this problem when starting everything manually, just when using the OpenWRT configuration system. Since AuthSAE did work, and seemed more stable at the time I settled for that. |
Thanks for your answering. I was having race condition with dnsmasq (for Ethernet). Not sure if below link is of any help. |
Any final solutions on this? I have exactly same issue in ath9k. Thanks! |
Nope, sorry.
Michel Stam
… On 3 May 2018, at 23:06, zhejunli ***@***.***> wrote:
Any final solutions on this? I have exactly same issue in ath9k. Thanks!
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
... I think I'm finally hitting it in ath9k at my current employer. Let's see how far down the rabbit hole I go. |
On Mon, Jul 02, 2018 at 05:31:06PM -0700, Adrian Chadd wrote:
... I think I'm finally hitting it in ath9k at my current employer. Let's see how far down the rabbit hole I go.
FWIW I think the way authsae does rekeying could be reworked to avoid
this -- it rekeys the PMK but we could/should rekey the MTK instead.
Then there'd be a different key id so both keys could be present in
hardware for a short time.
But for MGTK I don't think there's an equivalent solution.
|
Hi,
I'm going to experiment with installing a second key and then blanking out
the first one, or maybe blanking out the first one before adding the
second. The challenge is figuring out whether the keycache will let you get
away with such hijinx for the peer key.
I'll also see if stopping RX whilst programming unicast keycache slot
updates helps. At least in net80211 you get a "i'm going to do a keycache
update in a sec" so you can batch keycache updates behind say, stopping the
TX/RX path. I dunno whether that's easy on mac80211 but I'll see.
…-adrian
On Mon, 2 Jul 2018 at 18:08, Bob Copeland ***@***.***> wrote:
On Mon, Jul 02, 2018 at 05:31:06PM -0700, Adrian Chadd wrote:
> ... I think I'm finally hitting it in ath9k at my current employer.
Let's see how far down the rabbit hole I go.
FWIW I think the way authsae does rekeying could be reworked to avoid
this -- it rekeys the PMK but we could/should rekey the MTK instead.
Then there'd be a different key id so both keys could be present in
hardware for a short time.
But for MGTK I don't think there's an equivalent solution.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#42 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABGl7ZDiq_vYkUU2nngAzb59Ql8F126Bks5uCsQfgaJpZM4I81o7>
.
|
interesting. if I replumb the keys on the receiver side then it doesn't fix things. That's ... odd. |
(so I wonder if there are two bugs here..) |
Per my understanding:
|
Oh I know it's an ath9k chip bug. :-) That's what I have on my desk atm |
1 similar comment
Oh I know it's an ath9k chip bug. :-) That's what I have on my desk atm |
I mean, the link shows a key installation deficiency in the 4-way handshake mechanism. Following that idea to solve that key re-installation problem may help to solve this "ath9k chip bug" too. |
my rekeying patch (which was merged in #59 ) does that, and works well, except when the chip is under heavy load. |
Yeah, the "heavy load" issue is our issue here. I'll keep digging and see what I can find. Is your patch focused on rekeying the transmit side key, or the remote/receiver key? (yes the CCMP keys are used for both TX/RX, I'm more interested in which side is doing the replumbing of keys to the HW.) |
ok, yeah. I'm seeing three separate bugs:
I'm doing a UDP iperf of a few tens of mbit from an ath9k AP -> ath9k STA (both Peacock / AR9580) to reproduce this. In all cases the right keys make it through to the keycache code. When it breaks then at least whenever I've caught it the STA can still send frames to the AP which the AP can decrypt, but the frames from the AP can't be decrypted by the STA. I wonder if the sender side hardware bug will suck less if we just completely pause transmit before doing a rekey (because that's what the rekeying patch seems to detect). The RX side rekey thing that QCA does is a different beast and I think fixes another bit of the problem - I'm not actively TXing (besides ACKs, obviously) from the STA -> AP during the failure mode, so if there's a hardware bug it's likely due to having a packet in receive flight during rekeying. (Maybe I can experiment with pausing the MAC TX/RX whilst replumbing the key, which would have the added benefit of not ACKing anything during that window..) |
Ok, so bug 1 here was fixed by just disabling PTK rekey. It turns out the data/control path for transmit and mac80211 key management is not really setup for doing seamless PTK rekey at least on the transmit side, and you can't guarantee to not drop frames on the receive side either. So, I'm not going to do it. Which means the second and third don't happen. Now, if someone has some spare time (and maybe me too) I think we should experiment with trialling draining the ath9k station queue so we aren't transmitting anything that can use that keycache entry before we plumb in the new key. It's tricky because mac80211/ath9k aren't setup for that. But I /think/ that'll work around the TX keycache bug. The RX keycache bug shouldn't be triggered if you're not actively receiving packets to decrypt whilst you're changing the key - which you can guarantee if your AP is not doing stupid crap (ie, has this bug fixed) but you can't otherwise; that particular one will benefit from the keycache plumb hack from QCA. However, and here's the rub - it doesn't seem to work reliably if you're constantly hitting it with a stream of packets. It really needs to sneak in when no active RX is being done for that keycache entry. |
I've found the same issue with rekeying a PTK under load and are currently trying to upstream a fix for that. In fact there are different ways how normal kernels can mess up the PN, but I think we found them all now. You will get warnings that the userspace (wpa_supplicant) is requesting rekeys while it should not which will need patches to either ath9k or wpa_supplicant which simply are not available, yet. When testing the patch I suggest you also make sure you have this fix applied: Edit: updated link |
Guys,
I've been testing an annoying bug I've been having on a mesh of (currently) 2 units for the past week or so, but I cannot seem to find what causes it.
It happens when I run an iperf test between the units. At or around the time the SAE lifetime expires, a rekey occurs, after which traffic between the units stops.
Sometimes a packet arrives about a key lifetime later, but it does not get stable anymore.
If I leave the link idle (no iperf test, just some pings), then this problem does not seem to occur.
Looking at the debug traces from meshd-nl80211I can find no fault. I also looked at the key material sent down to the ath9k driver (printk's in the kernel driver), but even reading back those registers does not indicate to me that there's a fault.
Both units use an ath9k Atheros card; One is an AzureWave AR5B95, the other is a Compex WLE200N2-23. I have also observed the problem on Compex WLE350NX cards, so I am guessing this is not hardware related.
I set up both units with the attached config below;
meshd.txt
The kernel I use 4.4.11, but I've seen the same problem with 3.10.49.
The compat-wireless 2016-01-10 driver set used by OpenWRT seems to have the same problem with the old 3.10.34 kernel I run on that system.
The iperf setup is (using 2.0.5):
I create the mesh interfaces by:
Right now the key lifetime is at 60 seconds for problem reproduction, but I have seen the same problem on a link with a key lifetime of 3600 seconds; the link then dies at that time.
Can anyone give me a couple of pointers where to look, or maybe help me out?
Regards,
Michel Stam
The text was updated successfully, but these errors were encountered: