-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
service-mirror: support gateway resolving to multiple addresses #11499
Conversation
35960c2
to
b6c2274
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @MrFreezeex, at a glance this looks good. An issue would have been nice for us to have a couple of examples and to understand the problem better. I suppose the issue is that an endpoint might go down and you get downtime while the service mirror is resolving the address again?
As far as the code is concerned, it looks good (modulo the comment I left on the sort).
Thanks for the review! Sorry for not creating the issue prior to this, so we did have a small downtime probably because of that but I am not 100% sure of it. The gist of it is that if you give an address to the gateway that resolve to multiple IPs without this code change it mirrored only one IP. In our case we have an AWS load balancer that resolves to 3 IPs for instance and dumping the mirrored kubernetes endpoints only show one of those IP. |
If the gateway address was a name resolving to multiple addresses, previously service-mirror took only one of those address. This behavior could led to downtime in case the gateway behind the IP picked by service-mirror is down. This commit fix this behavior by considering all the IPs resolved. Signed-off-by: Arthur Outhenin-Chalandre <[email protected]>
Now that we consider all the IPs resolved by the DNS, we make sure they are sorted lexicographically so that resolveGatewayAddress answer a stable output. Signed-off-by: Arthur Outhenin-Chalandre <[email protected]>
b6c2274
to
8921295
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MrFreezeex the change looks great! I tested it out manually by creating a headless service with a manual EndpointSlice to create DNS records (much easier than hacking CoreDNS).
:; k exec dnsutil-pod -- nslookup headless-test.default.svc.cluster.local
Server: 10.43.0.10
Address: 10.43.0.10#53
Name: headless-test.default.svc.cluster.local
Address: 192.168.224.4
Name: headless-test.default.svc.cluster.local
Address: 192.168.224.5
If we pass --gateway-addresses=headless-test.default.svc.cluster.local
to the Link
command we can see the IPs being resolved in the probe service's endpoints
[
{
"addresses": [
"192.168.224.4"
],
"conditions": {
"ready": true
}
},
{
"addresses": [
"192.168.224.5"
],
"conditions": {
"ready": true
}
}
]
I considered asking for an integration test. It'd be nice to have but also painful to wire up without over abstracting this particular piece of code to take in a DNS resolver. All in all, I'm confident with the manual test. If you feel like contributing and know how to solve this easily in an integration test we'd appreciate a follow-up! :)
Thanks again 🚢
Thanks @mateiidavid 🙏. I was wondering do you think this patch could be considered for backport to the 2.14 branch? Happy to submit a PR with the cherry-pick if yes! :D |
@MrFreezeex we generally consider only bug fixes for backports. New features are expected to release in new stable versions. We could probably argue that this is a bug fix though :). With that being said, we should let the change simmer for a bit in an edge release (or two). We almost exclusively release all changes to an edge version first to ensure they don't break anyone's production environment. If an edge version considered as a release candidate doesn't raise any major flags, we'll cherrypick fixes into point releases. I know you're probably eager to see this shipped. When it makes it into an edge (should happen this week) you could only upgrade the multicluster extension and leave the rest of the control plane on stable. We'll consider backporting this into a stable point release; if you can only stay on stable then it might take a bit more time until you can rollout the new change. The sooner we get feedback around changes from our user base the quicker fixes can be released usually. I hope this all makes sense, and thank you for your patience and for your contribution! I really appreciate all of the effort you've put in. Please feel free to reply here, or in a discussion, or anywhere else if you want to continue the conversation. |
No worries and thanks for the kind words. Please keep it on your radar for 2.14.4 if there was enough edge releases in the meantime 🙏.
Oh indeed thanks that's a great idea, I will do that! |
## edge-23.11.2 This edge release contains observability improvements and bug fixes to the Destination controller, and a refinement to the multicluster gateway resolution logic. * Fixed an issue where the Destination controller could stop processing service profile updates, if a proxy subscribed to those updates stops reading them; this is a followup to the issue [#11491] fixed in edge-23.10.3 ([#11546]) * In the Destination controller, added informer lag histogram metrics to track whenever the objects tracked are falling behind the state in the kube-apiserver ([#11534]) * In the multicluster service mirror, extended the target gateway resolution logic to take into account all the possible IPs a hostname might resolve to, not just the first one (thanks @MrFreezeex!) ([#11499]) * Added probes to the debug container to appease environments requiring probes for all containers ([#11308])
## edge-23.11.2 This edge release contains observability improvements and bug fixes to the Destination controller, and a refinement to the multicluster gateway resolution logic. * Fixed an issue where the Destination controller could stop processing service profile updates, if a proxy subscribed to those updates stops reading them; this is a followup to the issue [#11491] fixed in [edge-23.10.3] ([#11546]) * In the Destination controller, added informer lag histogram metrics to track whenever the Kubernetes objects watched by the controller are falling behind the state in the kube-apiserver ([#11534]) * In the multicluster service mirror, extended the target gateway resolution logic to take into account all the possible IPs a hostname might resolve to, rather than just the first one (thanks @MrFreezeex!) ([#11499]) * Added probes to the debug container to appease environments requiring probes for all containers ([#11308]) [edge-23.10.3]: https://github.com/linkerd/linkerd2/releases/tag/edge-23.10.3 [#11546]: #11546 [#11534]: #11534 [#11499]: #11499 [#11308]: #11308
Hi @mateiidavid, I saw that there was two new release of linkerd since then and it's been a couple of weeks that the edge release is out. Would it be possible to include this patch in the next one 🙏 ? |
Hey @MrFreezeex! Wanted to let you know that I did see your message even though I didn't reply immediately. I raised it with the team :) I think including it makes sense. |
If the gateway address was a name resolving to multiple addresses, previously service-mirror took only one of those address. This behavior could led to downtime in case the gateway behind the IP picked by service-mirror is down. This commit fix this behavior by considering all the IPs resolved. Now that we consider all the IPs resolved by the DNS, we make sure they are sorted lexicographically so that resolveGatewayAddress answer a stable output. Signed-off-by: Arthur Outhenin-Chalandre <[email protected]>
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633) * b24b0e97c stable-2.14.6
* 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633) * b24b0e97c stable-2.14.6
stable-2.14.6 * 0a72f1f Add imagePullSecrets to the multicluster chart. (#11287) * 284d76b service-mirror: support gateway resolving to multiple addresses (#11499) * 64bccd9 Improve klog (client-go logs) handling (#11632) * 6a07e2c Add ability to configure client-go's `QPS` and `Burst` settings (#11644) * e294c78 Bump prometheus to v2.48.0 (#11633)
If the gateway address was a name resolving to multiple addresses, previously service-mirror took only one of those address. This behavior could led to downtime in case the gateway behind the IP picked by service-mirror is down. This commit fix this behavior by considering all the IPs resolved.