CoreDNS name resolution fails #1767

dwagelaar · 2021-09-08T12:24:17Z

dwagelaar
Sep 8, 2021

We're currently testing several RKE2 clusters on top of CentOS 8 Stream, each with 3 manager nodes and 3 worker nodes. The clusters work well after applying the required CentOS 8 fixes, except for CoreDNS. However, whenever a pod needs to look up another pod's cluster IP address in CoreDNS, CoreDNS responds with host not found (NXDOMAIN).

We've gone through https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ to find the problem, but logs show DNS requests coming in, and apparently DNS responses arriving at the requesting pods (i.e. the applications that did the DNS lookup fail instantly, and not after some time-out).

What can we do to resolve this issue?

Some additional information:

[dennis.wagelaar@*** ~]$ kubectl exec -i -t dnsutils -- nslookup kubernetes
Server:         10.43.0.10
Address:        10.43.0.10#53

Non-authoritative answer:
*** Can't find kubernetes: No answer

[dennis.wagelaar@*** ~]$ kubectl exec -i -t dnsutils -- nslookup www.google.com
Server:         10.43.0.10
Address:        10.43.0.10#53

Non-authoritative answer:
Name:   www.google.com
Address: 142.250.201.196
Name:   www.google.com
Address: 2a00:1450:400d:80a::2004

[dennis.wagelaar@*** ~]$ kubectl exec -i -t dnsutils -- ping 10.43.0.10
PING 10.43.0.10 (10.43.0.10): 56 data bytes
^C
--- 10.43.0.10 ping statistics ---
14 packets transmitted, 0 packets received, 100% packet loss
command terminated with exit code 1
[dennis.wagelaar@*** ~]$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
Server:         10.43.0.10
Address:        10.43.0.10#53

** server can't find kubernetes.default: NXDOMAIN

command terminated with exit code 1

So CoreDNS responds correctly to everything except hostnames that exist within the kubernetes cluster ("No answer").

Doing three nslookups like above on git.rancher.io, kubernetes, and kubernetes.default results in the following wireshark output:

[dennis.wagelaar@*** ~]$ sudo tshark -f "host 10.42.2.22" -i calid004eb10516
Running as user "root" and group "root". This could be dangerous.
Capturing on 'calid004eb10516'
    1 0.000000000   10.42.2.22 → 10.42.2.25   DNS 114 Standard query 0x27e9 A git.rancher.io.default.svc.corops-qa-careconnect.local
    2 0.000534136   10.42.2.25 → 10.42.2.22   DNS 114 Standard query response 0x27e9 No such name A git.rancher.io.default.svc.corops-qa-careconnect.local
    3 0.000827976   10.42.2.22 → 10.42.2.25   DNS 106 Standard query 0xf8d2 A git.rancher.io.svc.corops-qa-careconnect.local
    4 0.001446709   10.42.2.25 → 10.42.2.22   DNS 106 Standard query response 0xf8d2 No such name A git.rancher.io.svc.corops-qa-careconnect.local
    5 0.001624376   10.42.2.22 → 10.42.2.25   DNS 102 Standard query 0xd702 A git.rancher.io.corops-qa-careconnect.local
    6 0.002526918   10.42.2.25 → 10.42.2.22   DNS 102 Standard query response 0xd702 No such name A git.rancher.io.corops-qa-careconnect.local
    7 0.002719811   10.42.2.22 → 10.42.2.25   DNS 92 Standard query 0x6803 A git.rancher.io.qa.careconnect.be
    8 0.003343139   10.42.2.25 → 10.42.2.22   DNS 92 Standard query response 0x6803 No such name A git.rancher.io.qa.careconnect.be
    9 0.003518677   10.42.2.22 → 10.42.2.25   DNS 74 Standard query 0x1c47 A git.rancher.io
   10 0.003626137   10.42.2.25 → 10.42.2.22   DNS 317 Standard query response 0x1c47 A git.rancher.io CNAME ext-services-lb-2a685576fdf664e2.elb.us-west-2.amazonaws.com A 34.208.213.149 A 52.36.54.134 NS ns-1283.awsdns-32.org NS ns-1870.awsdns-41.co.uk NS ns-424.awsdns-53.com NS ns-748.awsdns-29.net
   11 0.003840131   10.42.2.22 → 10.42.2.25   DNS 120 Standard query 0x4530 AAAA ext-services-lb-2a685576fdf664e2.elb.us-west-2.amazonaws.com
   12 0.004606081   10.42.2.25 → 10.42.2.22   DNS 120 Standard query response 0x4530 AAAA ext-services-lb-2a685576fdf664e2.elb.us-west-2.amazonaws.com

   13 25.748896469   10.42.2.22 → 10.42.2.25   DNS 110 Standard query 0xb050 A kubernetes.default.svc.corops-qa-careconnect.local
   14 25.756204846   10.42.2.25 → 10.42.2.22   DNS 185 Standard query response 0xb050 No such name A kubernetes.default.svc.corops-qa-careconnect.local SOA a.root-servers.net
   15 25.756460497   10.42.2.22 → 10.42.2.25   DNS 102 Standard query 0x5154 A kubernetes.svc.corops-qa-careconnect.local
   16 26.516805061   10.42.2.25 → 10.42.2.22   DNS 177 Standard query response 0x5154 No such name A kubernetes.svc.corops-qa-careconnect.local SOA a.root-servers.net
   17 26.517093006   10.42.2.22 → 10.42.2.25   DNS 98 Standard query 0x0365 A kubernetes.corops-qa-careconnect.local
   18 26.518579048   10.42.2.25 → 10.42.2.22   DNS 173 Standard query response 0x0365 No such name A kubernetes.corops-qa-careconnect.local SOA a.root-servers.net
   19 26.518794238   10.42.2.22 → 10.42.2.25   DNS 88 Standard query 0x3c34 A kubernetes.qa.careconnect.be
   20 26.527935556   10.42.2.25 → 10.42.2.22   DNS 178 Standard query response 0x3c34 No such name A kubernetes.qa.careconnect.be SOA ns3.combell.net
   21 26.528157695   10.42.2.22 → 10.42.2.25   DNS 70 Standard query 0xb55f A kubernetes
   22 26.528621865   10.42.2.25 → 10.42.2.22   DNS 70 Standard query response 0xb55f A kubernetes
   23 26.528830613   10.42.2.22 → 10.42.2.25   DNS 70 Standard query 0x4235 AAAA kubernetes
   24 26.529430147   10.42.2.25 → 10.42.2.22   DNS 70 Standard query response 0x4235 AAAA kubernetes

   25 139.958766277   10.42.2.22 → 10.42.2.25   DNS 118 Standard query 0x4382 A kubernetes.default.default.svc.corops-qa-careconnect.local
   26 139.969419115   10.42.2.25 → 10.42.2.22   DNS 193 Standard query response 0x4382 No such name A kubernetes.default.default.svc.corops-qa-careconnect.local SOA a.root-servers.net
   27 139.969829545   10.42.2.22 → 10.42.2.25   DNS 110 Standard query 0xddca A kubernetes.default.svc.corops-qa-careconnect.local
   28 139.970237924   10.42.2.25 → 10.42.2.22   DNS 110 Standard query response 0xddca No such name A kubernetes.default.svc.corops-qa-careconnect.local
   29 139.970443439   10.42.2.22 → 10.42.2.25   DNS 106 Standard query 0x1ea3 A kubernetes.default.corops-qa-careconnect.local
   30 140.060949191   10.42.2.25 → 10.42.2.22   DNS 181 Standard query response 0x1ea3 No such name A kubernetes.default.corops-qa-careconnect.local SOA a.root-servers.net
   31 140.061294639   10.42.2.22 → 10.42.2.25   DNS 96 Standard query 0x12dc A kubernetes.default.qa.careconnect.be
   32 140.072050406   10.42.2.25 → 10.42.2.22   DNS 186 Standard query response 0x12dc No such name A kubernetes.default.qa.careconnect.be SOA ns3.combell.net
   33 140.072359309   10.42.2.22 → 10.42.2.25   DNS 78 Standard query 0xb765 A kubernetes.default
   34 140.090088528   10.42.2.25 → 10.42.2.22   DNS 153 Standard query response 0xb765 No such name A kubernetes.default SOA a.root-servers.net

UPDATE: this problem stopped occurring as soon as I restarted dnsutils, and it got scheduled on a different node, it started working... However, on another cluster, where dnsutils was already scheduled on a different node, the problem remained. It appears that the problem occurs only when the DNS client is scheduled on a specific node, but that specific node could be any node...

besmirzanaj · 2022-11-25T22:45:07Z

besmirzanaj
Nov 25, 2022

same boat here on rke2

kubectl exec -i -t dnsutils -- nslookup www.google.com
Server:         1.1.1.1
Address:        1.1.1.1#53

Non-authoritative answer:
*** Can't find www.google.com: No answer

0 replies

brandond · 2022-11-26T05:12:26Z

brandond
Nov 26, 2022
Maintainer

How did you configure your cluster to use corops-qa-careconnect.local as the cluster domain? Are you sure that coredns is using this setting?

WIth regards to failures you're seeing when the coredns pod is not on the same node as the test pod, this is usually caused by something dropping the vxlan packets that traffic between pods and services that are on different nodes. Confirm that the vxlan port is open between all the nodes in your cluster, and that you're not affected by any common known issues:
https://docs.rke2.io/known_issues/#calico-with-vxlan-encapsulation

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreDNS name resolution fails #1767

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

CoreDNS name resolution fails #1767

dwagelaar Sep 8, 2021

Replies: 2 comments

besmirzanaj Nov 25, 2022

brandond Nov 26, 2022 Maintainer

dwagelaar
Sep 8, 2021

besmirzanaj
Nov 25, 2022

brandond
Nov 26, 2022
Maintainer