-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Meshnet cmdDel
is not idempotent, leading to state changes when CNI ADD and CNI DEL commands overlap
#70
Comments
This looks strange. Technically, based on how I understand it, the liveliness of a pod is determined by its srcIP and netNS fields: https://github.com/networkop/meshnet-cni/blob/master/plugin/meshnet.go#L235 So I think the key is to understand why lf1 thinks that lf5 is not alive. Can you get the full output of lf5's custom resource during the failure condition? curious if these fields get rewritten/deleted for some reason. |
Ok, here goes. Same emulation topology, different nodes this time. lf4 is stuck, so deleted it and recreated it so I can get the right meshnet-cni.log:
Pods lf1 and lf5 are alive, but lf3 is also stuck:
Topos for lf1 and lf5:
|
This doesn't look healthy at all. Both lf1 and lf5 have |
ph, actually, i just read your original message. so this only happens after you've deleted a pod and redeployed it. |
oh, I might actually know what happens here. When a pod is deleted, meshnet cleans up net_ns and src_ip but it doesn't clean up |
I'll give it a try and set a PR if it helps [ETA] It happens /most often/ with "restarting" pods, but it also happens on a clean deployment--but usually after I've been mucking about for a while. |
ok, sounds good, let me know if it works. |
I tried deleting the topos with the pods, but it didn't make a difference. FWIW, what I'm doing here is an outage event. I could exec into the containers and just down the links, but deleting the pods is quicker and makes it clearer that the pod is down when I'm analyzing the collected data (I'm running the PLG stack with SNMP-exporter so I get a full range of instrumentation). If you have advice, I'm open. |
I can't think of anything obvious other than that skipped cleanup. |
I'm still planning to take a crack at this, but I wanted to point out that I can see this issue happen on a full, clean deployment. Some pods will not get all the expected interfaces. When this happens, it seems to be [sometimes] sticky; IOW, if I undeploy and redeploy, the same pods will get stuck [or the whole deployment succeeds]. That makes me think the state issue is in the CNI interface or in meshnetd, not the resources. |
@Cerebus can you please share you CR yaml, containing pods and links. Since it's happening repeatedly at your end, I should be able to recreate it at my end and provide insight (if I can find any). |
Hi @Cerebus, |
there is no state anywhere other than in CRs. If you do a clean re-deploy, i.e. delete all existing CRs and recreate them, it should not looks and fail in the same way. ideally, like others have mentioned, if you can share a set of steps to reproduce, this would help move along. Another option is to wait for #73 to be merged as it fixes a few cleanup and sync issues, especially for GRPC wires. |
If I had a reliable reproduction case, I'd probably have either a detailed report or a fix long before now. Mainly I've just suffered with it, but I'm nearing a point where I need to start sharing my platform so it's bubbled back near the top of my fix list. This condition seems to occur totally at random. I've seen it occur on single-node and multi-node clusters, immediately after a cluster reboot, on an initial topology deployment (or complete re-deployment from scratch), after re-applying a Pod that had been deleted from a running topology, in deployments of 2 Pods or hundreds of Pods, and with (complete or partial) delete/create events separated by seconds, minutes, or hours. All I can say is it /happens/. You sense my frustration. :( What I also don't get is why it seems to only happen to me. It could be that y'all just don't notice. For easy detection, here's what I do in an initContainer (yeah, the shell script is hacky, but it's somewhat limited by #!/bin/sh
# Abort without doing anything if env vars are unset.
: ${IFACES?}
ifacecount() {
# Count links, not counting lo and container default eth0
NUMIFACES=$(($(ip link | wc -l) / 2 - 2))
}
ifacecount
# count expected addresses from downward API (count commas and add 1)
EXPECTEDIFACES=$(($(echo ${IFACES} | tr -dc ',' | awk '{ print length; }') + 1))
while [ "${NUMIFACES}" -lt "$EXPECTEDIFACES" ]; do
echo "waiting; expected: ${EXPECTEDIFACES} found: ${NUMIFACES}"
sleep 1
ifacecount
done This will stick your pods in I see that #73 merged, so I'll redeploy meshent and see what happens. |
OK, I have something odd here; not sure if it's reproducible but maybe onto something. From the above, delete the pod and everything seems ok:
Then--recreate that pod before the second (spurious?) delete pod process happens:
Then, the second "delete" event occurs:
But the pod is still running and has all the right ifaces:
And, to cap it all off, Topology lf5 status has been wiped:
From which I predict that if I delete a neighbor and recreate it, it'll get stuck b/c lf5 "does not exist". And indeed, 'tis so, as the new lf1 pod is missing eth4:
ETA: Which further implies that I can recover by deleting the /pair/ of problem pods and redeploying them, and indeed, that works. It's possible that when I see this condition on a full start it's b/c it came after a full-teardown, so this double-call of CNI DEL wasn't complete yet. |
Is this relevant: |
It looks like you're onto something. I'm AFK for the rest of the week but it'd be nice to confirm where that second DELETE call originates from. Can you capture the logs from the API server at the same time to confirm the assumption from the k/k issue you linked? |
A couple of things--
IMHO this is a meshnet bug; DEL commands are not idempotent. Pod networking is tracked by the tuple One way to resolve this is to track status by CONTAINER_ID (i.e, sandbox). Conveniently this is a CNI argument (my own debugging statement, don't look for this anywhere):
I'll work on this unless someone has a better idea. I'd like to do this without breaking changes, but just looking at the status subresource CRD that may not be possible. |
cmdDel
is not idempotent, leading to state changes when CNI ADD and CNI DEL commands overlap
@Cerebus Yes, I was also working on handling multiple delete. This is common K8S behavior but was not handled in meshnet. Pull #74 looks good to me. I will stop on my part and will wait for pull #74 to come in. Interleaving of ADD and DEL we need to handle. If a DEL comes while ADD is in motion, meshnet will be in trouble. I will work on this one. |
Wouldn't kubelet serialize the operations in that case? |
That's very resorbable and that's how it should be. But not able to find a definite documentation on it yet. |
I still have this issue even with the fix for #74 where certain veth pairs are not being formed. I have not seen this issue with Vxlan binds still. What i notice is there is this error and then delete of one of the pods (dut-c). then all other pods are missing the links to this dut-c pod time="2023-04-26T21:33:42-04:00" level=info msg="Creating Veth struct with NetNS:/proc/19196/ns/net and intfName: eth5, IP:" Then there is DEL for dut-ctime="2023-04-26T21:33:43-04:00" level=info msg="Processing DEL request: dut-c" |
The portion of the log given here shows, all the 'Del' to be duplicate. This looked little odd. May be the log is incomplete. I can try this at my end. Using a single node cluster and multiple interconnected pod in it (so that all links are veth pair) - will it be close to your set up ? |
That is right, i have a lot of connections and 8 pods interconnecting.. This return i mean in meshnet.go |
After adding some extra logging to meshnet.go for veth creation in cmdAdd function, i can see there is a possiblity of race condition |
@networkop @kingshukdev @Cerebus |
I need help debugging this, b/c I really don't know what's going on.
When I create pods connected with meshnet, sometimes meshnet doesn't add the declared interfaces. This can happen at deployment or redeployment of a static pod. I set up my static pods with init containers that wait on the expected interfaces to be added, so it's easy to see when this happens--the Pod gets stuck in Init:0.
What I need to be able to do is reliably delete and restore static Pods from a topology. I do this by deleting the Pod, and /most/ of the time this works flawlessly--but sometimes it doesn't. It's difficult to reproduce this reliably. When it happens, I can sometimes recover by deleting the entire topology (all pods and topos) and re-deploying it. Sometimes the condition persists and I have to rollout restart meshnet. And sometimes I have to reboot the node or cluster to recover.
Looking at the meshnet log, when this happens meshnet thinks the other pod isn't up. As an example, at startup lf1 and lf5 link fine:
Then I delete pod lf1, wait a bit, then re-deploy lf1 from the same resource declaration. Then I get:
meshnetd's log looks normal:
I've tried /also/ deleting the lf1 and/or lf5 topo, but that doesn't seem to have an effect.
If I recover and redeploy and eventually recreate this condition, the affected pods or interfaces will change even with the same topology and pod declarations.
Am I bumping up against some resource limit somewhere?
The text was updated successfully, but these errors were encountered: