Add reconciliation for grpc-wires #57

alexmasi · 2022-12-14T22:18:42Z

When a grpc-wire enabled meshnet pod in a node restarts (due to OOM / Error, etc.) the grpc-wire info (wire/handler maps) is not persisted or reconciled on restart.

meshnet-cni/daemon/grpcwire/grpcwire.go

Line 143 in d3ae648

var wires = &wireMap{

This leads to errors like the following:

SendToOnce (wire id - 77): Could not find local handle. err:interface 77 is not active

stemming from:

meshnet-cni/daemon/grpcwire/grpcwire.go

Line 254 in d3ae648

return nil, fmt.Errorf("interface %d is not active", intfID)

To make grpc-wire add on more resilient, reconciliation should be added (likely using the topology CRD)

The text was updated successfully, but these errors were encountered:

alexmasi · 2023-05-30T23:34:15Z

#80

alexmasi · 2023-05-30T23:37:01Z

Testing the grpc wire reconciliation using a 150 node topology on KNE across 2 workers. Getting a few issues:

a handful of the nodes do not get passed the Init state with the following message:

Warning  FailedCreatePodSandBox  8m7s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "18635462e778e6c2ea1f99f6addc15f8ec8e4290cd742b47b8eaa69d782a7db7" network for pod "r96": networkPlugin cni failed to set up pod "r96_ceos-150" network: plugin type="meshnet" name="meshnet" failed (add): rpc error: code = Unknown desc = Topology.networkop.co.uk "r96" is invalid: status.skipped[0]: Invalid value: "object": status.skipped[0] in body must be of type string: "object"

seeing several warnings like the following:

I0530 23:15:54.997458   20398 topo.go:403] Creating topology for meshnet node r5
W0530 23:15:55.161037   20398 warnings.go:70] unknown field "status.container_id"

I am not seeing the CRD

$ kubectl get gwirekobjs -A
error: the server doesn't have a resource type "gwirekobjs"

@kingshukdev

kingshukdev · 2023-05-31T09:26:01Z

@alexmasi
It looks like old crd yamls has been used to deploy meshent with newer meshnet binary. So the CRD in definition in K8S and the CRD the binary supports are not in sync. Please take the latest yaml (manifest folder) from master branch for deployment and let us know it solves the issue.

alexmasi · 2023-05-31T19:10:13Z

Thanks Kingshuk, thats a mistake on my end. The grpc wire reconciliation appears to be working now. When I delete a meshnet pod it reloads with full information about the already created links. I appreciate the implementation!

However there is a separate issue with meshnet reconciliation in general. I tried deleting/recreating a pod during the topology creation and it mostly works except some of the topologies (in this case init containers for several of the router pods) get stuck waiting:

$ kubectl get pods -A -o wide | grep Init
ceos-150                         r111                                                          0/1     Init:0/1   0          37m   10.244.2.249   alexmasi-worker-2     <none>           <none>
ceos-150                         r112                                                          0/1     Init:0/1   0          42m   10.244.1.217   alexmasi-worker-1     <none>           <none>
ceos-150                         r113                                                          0/1     Init:0/1   0          41m   10.244.1.233   alexmasi-worker-1     <none>           <none>
ceos-150                         r124                                                          0/1     Init:0/1   0          43m   10.244.1.213   alexmasi-worker-1     <none>           <none>
ceos-150                         r125                                                          0/1     Init:0/1   0          42m   10.244.1.218   alexmasi-worker-1     <none>           <none>
ceos-150                         r126                                                          0/1     Init:0/1   0          41m   10.244.1.229   alexmasi-worker-1     <none>           <none>
ceos-150                         r5                                                            0/1     Init:0/1   0          44m   10.244.1.205   alexmasi-worker-1     <none>           <none>
ceos-150                         r6                                                            0/1     Init:0/1   0          42m   10.244.1.216   alexmasi-worker-1     <none>           <none>
ceos-150                         r7                                                            0/1     Init:0/1   0          41m   10.244.1.235   alexmasi-worker-1     <none>           <none>

$ kubectl logs r5 -n ceos-150 init-r5 | tail -1
Connected 2 interfaces out of 3
$ kubectl logs r6 -n ceos-150 init-r6 | tail -1
Connected 1 interfaces out of 3
$ kubectl logs r7 -n ceos-150 init-r7 | tail -1
Connected 2 interfaces out of 3

Note all but one of these cases happens on the worker node where the meshnet pod was deleted mid topology creation. Did you come across this issue in your testing @kingshukdev ?

kingshukdev · 2023-06-01T14:42:18Z

@alexmasi glad to know that recon worked.

I can think of few tricky situation if meashnet daemon is restarted during topology creation. It is very very time sensitive - meshnet daemon is not available but K8S is trying to create the next pod. If the meshnet daemon come up fast before K8S retries then it will go through.

How are you restating meshent daemon - is it "kill -9 pid" ? Once we know how you are restarting then we can try playing with that.

alexmasi · 2023-06-01T19:31:46Z

kubectl delete pod meshnet-****** -n meshnet

then k8 will auto bring up a new pod to match the intent

Cerebus · 2024-04-05T17:02:49Z

There seems to be a bug in #80. In my single-node cluster, I get:

time="2024-04-05T11:57:18-05:00" level=error msg="failed to run meshnet cni: <nil>"
time="2024-04-05T11:57:58-05:00" level=error msg="Add[c]: Failed to set a skipped flag on peer a"

For all pods after the first two or three pods.

> k get pods
NAME   READY   STATUS     RESTARTS   AGE
a      0/2     Init:0/1   0          24s
b      0/2     Init:0/1   0          24s
c      0/2     Init:0/1   0          24s
d      0/2     Init:0/1   0          24s
aa     2/2     Running    0          23s
dd     2/2     Running    0          23s

(Pods stick in Init b/c I have an initContainer that waits on all the interfaces to be added. Since the CNI client is failing, this initContainer never exits.)

ETA: In this deployment, Pods a/b/c/d are linked in a "diamond" network (peers are a-b, b-d, a-c, c-d, and b-c), and Pods aa/dd are linked to only one peer. So I speculate that this has something to do with Pods with multiple peers. More testing needed.

alexmasi mentioned this issue Dec 14, 2022

Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55

Closed

alexmasi mentioned this issue May 31, 2023

Upgrade meshnet openconfig/kne#381

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reconciliation for grpc-wires #57

Add reconciliation for grpc-wires #57

alexmasi commented Dec 14, 2022

alexmasi commented May 30, 2023

alexmasi commented May 30, 2023 •

edited

Loading

kingshukdev commented May 31, 2023

alexmasi commented May 31, 2023

kingshukdev commented Jun 1, 2023 •

edited

Loading

alexmasi commented Jun 1, 2023

Cerebus commented Apr 5, 2024 •

edited

Loading

Add reconciliation for grpc-wires #57

Add reconciliation for grpc-wires #57

Comments

alexmasi commented Dec 14, 2022

alexmasi commented May 30, 2023

alexmasi commented May 30, 2023 • edited Loading

kingshukdev commented May 31, 2023

alexmasi commented May 31, 2023

kingshukdev commented Jun 1, 2023 • edited Loading

alexmasi commented Jun 1, 2023

Cerebus commented Apr 5, 2024 • edited Loading

alexmasi commented May 30, 2023 •

edited

Loading

kingshukdev commented Jun 1, 2023 •

edited

Loading

Cerebus commented Apr 5, 2024 •

edited

Loading