Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55

alexmasi · 2022-12-13T02:47:43Z

I am having some issues when using meshnet with grpc-wire.

I have a >300 node topology (running in KNE) spread across a 5 node k8 cluster (4-workers).

Many of the router pods get stuck in Init:0/1 state and some of the meshnet pods get stuck in a crash loop:

$ kubectl get pods -n meshnet -o wide
NAME            READY   STATUS             RESTARTS           AGE    IP            NODE         NOMINATED NODE   READINESS GATES
meshnet-6gzft   1/1     Running            0                  4d9h   10.240.0.11   controller   <none>           <none>
meshnet-6vqtf   1/1     Running            0                  4d9h   10.240.0.21   worker-1     <none>           <none>
meshnet-849jl   1/1     Running            279 (3d2h ago)     4d9h   10.240.0.23   worker-3     <none>           <none>
meshnet-g7swq   0/1     CrashLoopBackOff   1141 (108s ago)    4d9h   10.240.0.24   worker-4     <none>           <none>
meshnet-pwr6k   0/1     CrashLoopBackOff   1176 (4m56s ago)   4d9h   10.240.0.22   worker-2     <none>           <none>

$ kubectl logs meshnet-g7swq -n meshnet
Distributing files
Starting meshnetd daemon
time="2022-12-13T02:24:29Z" level=info msg="Trying in-cluster configuration"
time="2022-12-13T02:24:29Z" level=info msg="[core] [Server #1] Server created" system=system
time="2022-12-13T02:24:29Z" level=info msg="Starting meshnet daemon...with grpc support"
time="2022-12-13T02:24:29Z" level=info msg="GRPC server has started on port: 51111"
time="2022-12-13T02:24:29Z" level=info msg="[core] [Server #1 ListenSocket #2] ListenSocket created" system=system
time="2022-12-13T02:25:03Z" level=info msg="Error creating vEth pair (in:eth36 <--> out:bx04reth36-0001).  Error-> failed to rename link koko3901300170 -> eth36: file exists"
time="2022-12-13T02:25:03Z" level=error msg="AddWireRemote err : failed to rename link koko3901300170 -> eth36: file exists"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x140d56a]

goroutine 98 [running]:
github.com/networkop/meshnet-cni/daemon/meshnet.(*Meshnet).AddGRPCWireRemote(0x0?, {0x0?, 0x0?}, 0x0?)
	/go/src/github.com/networkop/meshnet-cni/daemon/meshnet/handler.go:348 +0xaa
...

I see that the failed pod:eth (bx04r:eth36) is reverse-skipped by the peer pod on that link (bx03s:eth1): Reverse-skipping of pod bx04rno10 by pod bx03sfo03

excerpt from KNE topology:

links: {
  a_node: "bx03sfo03"
  a_int: "eth1"
  z_node: "bx04rno10"
  z_int: "eth36"
}

However I do not see an entry in the ip link table for bx03s* even though bx03 is stuck in Init state on a worker with a healthy meshnet pod.

On another note, even the running pods report strange errors (which seem like simple no-op bugs upon inspection):

$ kubectl logs meshnet-6vqtf -n meshnet | grep bx07iad60 | head -100
time="2022-12-08T16:46:47Z" level=info msg="Retrieving bx07iad60's metadata from K8s..."
time="2022-12-08T16:46:47Z" level=info msg="Reading pod bx07iad60 from K8s"
time="2022-12-08T16:46:47Z" level=info msg="Skipping of pod bx07iad60 by pod bx09iad60"
time="2022-12-08T17:04:59Z" level=info msg="Retrieving bx07iad60's metadata from K8s..."
time="2022-12-08T17:04:59Z" level=info msg="Reading pod bx07iad60 from K8s"
time="2022-12-08T17:04:59Z" level=info msg="Setting bx07iad60's SrcIp=10.240.0.21 and NetNs=/proc/11867/ns/net"
time="2022-12-08T17:04:59Z" level=info msg="Reading pod bx07iad60 from K8s"
time="2022-12-08T17:05:00Z" level=info msg="Update pod status bx07iad60 from K8s"
time="2022-12-08T17:05:01Z" level=info msg="Checking if bx06iad23 is skipped by bx07iad60"
time="2022-12-08T17:05:03Z" level=info msg="Checking if bx09iad60 is skipped by bx07iad60"
time="2022-12-08T17:05:04Z" level=info msg="Checking if bx06iad30 is skipped by bx07iad60"
time="2022-12-08T17:05:05Z" level=info msg="Removing grpc-wire for pod arista-kne-topo:bx07iad60"
time="2022-12-08T17:05:06Z" level=info msg="Removed all grpc-wire for pod: eth6@bx07iad60"
time="2022-12-08T17:05:06Z" level=error msg="finished unary call with code Unknown" error="failed to remove gRPC wire for pod bx07iad60: %!w(<nil>)" grpc.code=Unknown grpc.method=RemGRPCWire grpc.service=meshnet.v1beta1.Local grpc.start_time="2022-12-08T17:05:05Z" grpc.time_ms=94.334 peer.address="127.0.0.1:40130" span.kind=server system=grpc

In the worker nodes with the failed meshnet pods I see many koko links waiting to be renamed:

worker-2:~$ ip -d link show | grep koko | head
1024: koko2916450998@if1025: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
3073: koko2154029822@if3074: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2817: koko3259706345@if2818: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2561: koko832348741@if2562: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2305: koko2799390267@if2306: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2049: koko1542542355@if2050: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
1793: koko4212678092@if1794: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
1281: koko1342903858@if1282: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
3075: koko497316275@if3076: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default
2819: koko1392270664@if2820: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default

Any help in debugging this further would be appreciated

The text was updated successfully, but these errors were encountered:

alexmasi · 2022-12-14T00:21:37Z

After using the container with the debug fixes I am seeing most of the pods come up (306/337). Now only 31 are stuck in an Init state.

Seeing the following potential issues:

Many of these messages across all workers:

e.g.

meshnet-nq2n6 (worker-2):

time="2022-12-13T22:24:56Z" level=info msg="SendToOnce (wire id - 77): Could not find local handle. err:interface 77 is not active"
time="2022-12-13T22:24:56Z" level=error msg="finished unary call with code Unknown" error="interface 77 is not active" grpc.code=Unknown grpc.method=SendToOnce grpc.service=meshnet.v1beta1.WireProtocol grpc.start_time="2022-12-13T22:24:56Z" grpc.time_ms=0.093 peer.address="10.240.0.24:37018" span.kind=server system=grpc

coming from here:

meshnet-cni/daemon/grpcwire/grpcwire.go

Line 254 in d3ae648

return nil, fmt.Errorf("interface %d is not active", intfID)

Seeing errors like the following, reading through the source code maybe this should just be info instead of error?

time="2022-12-13T22:23:34Z" level=error msg="Appending new element bx02cbf40"

Seeing a single occurrence of this failure:

meshnet-njd5r (worker-1):

time="2022-12-13T22:22:01Z" level=info msg="Error creating vEth pair (in:eth14 <--> out:bx02deth14-0003).  Error-> failed to Statfs \"/proc/20948/ns/net\": no such file or directory"
time="2022-12-13T22:22:01Z" level=error msg="AddWireRemote err: failed to Statfs \"/proc/20948/ns/net\": no such file or directory"
time="2022-12-13T22:22:01Z" level=error msg="finished unary call with code Unknown" error="failed to Statfs \"/proc/20948/ns/net\": no such file or directory" grpc.code=Unknown grpc.method=AddGRPCWireRemote grpc.service=meshnet.v1beta1.Remote grpc.start_time="2022-12-13T22:22:01Z" grpc.time_ms=106.531 peer.address="10.240.0.23:35042" span.kind=server system=grpc

coming from this chain:

meshnet-cni/daemon/grpcwire/grpcwire.go

Line 329 in d3ae648

log.Infof("Error creating vEth pair (in:%s <--> out:%s). Error-> %s", inIfNm, outIfNm, err)
https://github.com/redhat-nfvpe/koko/blob/f6773496d75e5497afa7c1e461e31637b763545a/api/koko_api.go#L413
https://github.com/containernetworking/plugins/blob/ec76e3c35c55f3107c45a54ce310c9c7e98bac75/pkg/ns/ns_linux.go#L125

alexmasi · 2022-12-14T00:27:05Z

Another thing I observed is two reboots of meshnet pods, one due to OOM and the other was Error status. My concern is that the "missing" interfaces were stored in local memory and did not persist across the reboot:

meshnet-cni/daemon/grpcwire/grpcwire.go

Line 143 in d3ae648

var wires = &wireMap{

networkop · 2022-12-14T13:20:13Z

After using the container with the debug fixes I am seeing most of the pods come up (306/337). Now only 31 are stuck in an Init state.

That's good news. I can merge the PR or wait until you add more to it? let me know

Seeing errors like the following, reading through the source code maybe this should just be info instead of error?

agreed

Seeing a single occurrence of this failure:

this looks weird. is there a chance a pod got deleted while the CNI plugin was still processing it?

Another thing I observed is two reboots of meshnet pods, one due to OOM and the other was Error status. My concern is that the "missing" interfaces were stored in local memory and did not persist across the reboot:.

You're most likely right. Looks like if meshnetd is restarted it would have no prior knowledge of existing grpc wires. Should this somehow be persisted on disk or api-server? I guess the pcap.Handles still need to be properly cleaned up and re-opened on pod restart.

alexmasi · 2022-12-14T21:49:13Z

I added the one more change to downgrade the log from error to info. I think this should be merged as is.

I spoke with Keysight and they have a branch fix-race on their fork that combined with memory resource limit increase (200Mi to 1000Mi) for the meshnet pod I was able to get it to work.

Hopefully they can get that merged into this repo.

I will open a separate issue as an FR to add proper reconciliation for grpc-wires (whether that be through the topology CRD or one of the methods you mention)

alexmasi · 2022-12-14T21:49:32Z

@kingshukdev for visibility

alexmasi · 2022-12-14T22:19:19Z

Closing in favor of reconciliation FR: #57

alexmasi mentioned this issue Dec 13, 2022

grpc-wire debug fixes #56

Merged

alexmasi mentioned this issue Dec 14, 2022

Need support for using gRPC (instead of VXLAN) meshnet link between pods deployed on separate nodes openconfig/kne#202

Closed

alexmasi closed this as completed Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55

Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55

alexmasi commented Dec 13, 2022

alexmasi commented Dec 14, 2022

alexmasi commented Dec 14, 2022

networkop commented Dec 14, 2022 •

edited

Loading

alexmasi commented Dec 14, 2022

alexmasi commented Dec 14, 2022

alexmasi commented Dec 14, 2022

Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55

Failure to AddWireRemote in large topology in multi-node cluster using grpc-wire #55

Comments

alexmasi commented Dec 13, 2022

alexmasi commented Dec 14, 2022

alexmasi commented Dec 14, 2022

networkop commented Dec 14, 2022 • edited Loading

alexmasi commented Dec 14, 2022

alexmasi commented Dec 14, 2022

alexmasi commented Dec 14, 2022

networkop commented Dec 14, 2022 •

edited

Loading