-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55
Comments
After using the container with the debug fixes I am seeing most of the pods come up (306/337). Now only 31 are stuck in an Init state. Seeing the following potential issues: Many of these messages across all workers: e.g. meshnet-nq2n6 (worker-2):
coming from here: meshnet-cni/daemon/grpcwire/grpcwire.go Line 254 in d3ae648
Seeing errors like the following, reading through the source code maybe this should just be info instead of error?
Seeing a single occurrence of this failure: meshnet-njd5r (worker-1):
coming from this chain:
|
Another thing I observed is two reboots of meshnet pods, one due to OOM and the other was Error status. My concern is that the "missing" interfaces were stored in local memory and did not persist across the reboot: meshnet-cni/daemon/grpcwire/grpcwire.go Line 143 in d3ae648
|
That's good news. I can merge the PR or wait until you add more to it? let me know
agreed
this looks weird. is there a chance a pod got deleted while the CNI plugin was still processing it?
You're most likely right. Looks like if meshnetd is restarted it would have no prior knowledge of existing grpc wires. Should this somehow be persisted on disk or api-server? I guess the pcap.Handles still need to be properly cleaned up and re-opened on pod restart. |
I added the one more change to downgrade the log from error to info. I think this should be merged as is. I spoke with Keysight and they have a branch Hopefully they can get that merged into this repo. I will open a separate issue as an FR to add proper reconciliation for grpc-wires (whether that be through the topology CRD or one of the methods you mention) |
@kingshukdev for visibility |
Closing in favor of reconciliation FR: #57 |
I am having some issues when using meshnet with grpc-wire.
I have a >300 node topology (running in KNE) spread across a 5 node k8 cluster (4-workers).
Many of the router pods get stuck in
Init:0/1
state and some of the meshnet pods get stuck in a crash loop:I see that the failed pod:eth (bx04r:eth36) is reverse-skipped by the peer pod on that link (bx03s:eth1):
Reverse-skipping of pod bx04rno10 by pod bx03sfo03
excerpt from KNE topology:
However I do not see an entry in the
ip link
table for bx03s* even though bx03 is stuck in Init state on a worker with a healthy meshnet pod.On another note, even the running pods report strange errors (which seem like simple no-op bugs upon inspection):
In the worker nodes with the failed meshnet pods I see many koko links waiting to be renamed:
Any help in debugging this further would be appreciated
The text was updated successfully, but these errors were encountered: