-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reconciliation for grpc-wires #57
Comments
Testing the grpc wire reconciliation using a 150 node topology on KNE across 2 workers. Getting a few issues:
|
@alexmasi |
Thanks Kingshuk, thats a mistake on my end. The grpc wire reconciliation appears to be working now. When I delete a meshnet pod it reloads with full information about the already created links. I appreciate the implementation! However there is a separate issue with meshnet reconciliation in general. I tried deleting/recreating a pod during the topology creation and it mostly works except some of the topologies (in this case init containers for several of the router pods) get stuck waiting:
Note all but one of these cases happens on the worker node where the meshnet pod was deleted mid topology creation. Did you come across this issue in your testing @kingshukdev ? |
@alexmasi glad to know that recon worked. I can think of few tricky situation if meashnet daemon is restarted during topology creation. It is very very time sensitive - meshnet daemon is not available but K8S is trying to create the next pod. If the meshnet daemon come up fast before K8S retries then it will go through. How are you restating meshent daemon - is it "kill -9 pid" ? Once we know how you are restarting then we can try playing with that. |
then k8 will auto bring up a new pod to match the intent |
There seems to be a bug in #80. In my single-node cluster, I get:
For all pods after the first two or three pods.
(Pods stick in Init b/c I have an initContainer that waits on all the interfaces to be added. Since the CNI client is failing, this initContainer never exits.) ETA: In this deployment, Pods a/b/c/d are linked in a "diamond" network (peers are a-b, b-d, a-c, c-d, and b-c), and Pods aa/dd are linked to only one peer. So I speculate that this has something to do with Pods with multiple peers. More testing needed. |
When a grpc-wire enabled meshnet pod in a node restarts (due to OOM / Error, etc.) the grpc-wire info (wire/handler maps) is not persisted or reconciled on restart.
meshnet-cni/daemon/grpcwire/grpcwire.go
Line 143 in d3ae648
This leads to errors like the following:
SendToOnce (wire id - 77): Could not find local handle. err:interface 77 is not active
stemming from:
meshnet-cni/daemon/grpcwire/grpcwire.go
Line 254 in d3ae648
To make grpc-wire add on more resilient, reconciliation should be added (likely using the topology CRD)
The text was updated successfully, but these errors were encountered: