-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine if RBAC updates should drain #311
Comments
Definitely feels like it should drain IMO |
this also impact waypoints. If a long lived connection exists, and a waypoint is deployed, ztunnel will not direct traffic through the waypoint until the original connection is closed and re-established. steps to reproduce using online boutique microservice:
|
I don't understand what "drain" here means. There's no general drain strategy for TCP connections. You can only abruptly terminate them. Is that what's proposed? @GregHanson That's a good point in favor of abrupt termination since authorization depends on Waypoints for HTTP rules, so it'd be incorrect not to route to Waypoint indefinitely. |
Yes, drain was poorly worded. Although fwiw I'm not proposing it, just that
we make an explicit decision - I am leaning towards not terminating
…On Fri, Jun 9, 2023, 3:43 PM Kuat ***@***.***> wrote:
I don't understand what "drain" here means. There's no general drain
strategy for TCP connections. You can only abruptly terminate them. Is that
what's proposed?
—
Reply to this email directly, view it on GitHub
<#311 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXJ2HZOVJ2N24BN24WTXKORIFANCNFSM6AAAAAATRCHGVI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I think never terminating would be dangerous in an eventually consistent system. When workloads and policies are deployed concurrently (enabling waypoint is a policy), the order should not matter when things open connections. So it seems necessary to terminate eventually, and the challenge is defining when to terminate connections and limiting the scope to only those should be. |
Agreed, I'd expect updated RBAC become effective even with long lived connection. cc @louiscryan for input |
I agree with the concerns of eventual consistency. For RBAC it makes more sense I think - we explicitly would close the connection if it was new. For Waypoint being added, though, closing the connection seems unneccesarily aggressive |
Dropping a connection that would previously have been allowed but no longer is, as soon as we are aware of that, regardless of its age, is an non-negotiable obligation on our end, IMO. Scoping that to "pure RBAC" so we don't have to do the same when L7 concerns are added with Waypoints is both a complex behavior to actually model, and creates behavioral exceptions in ztunnel around connection handling that make it harder to reason about - I think it costs more than it saves to parse the behavior differently here. I think we just drop in both cases, because that's the simplest, safest, and most obvious behavior. |
I don't agree its simpler nor safer. Its not simpler as we need to go implement it, which is strictly added complexity. Its not safer because it abruptly tears down users connections. I don't think most users will be happy when they deploy a waypoint and suddenly get a massive surge of 503s. Lin gave a talk about dynamically adding a waypoint to do a canary rollout - interestingly, its broken in both cases:
Both are pretty unfortunate |
I think we're over-indexing on TCP handling in ztunnel. If it was working on the packets, it would be completely natural to drop packets as soon as the routing + rbac imply so. Since ztunnel has to support UDP and L3 eventually, it is the necessary thing to do. The fact that errors are opaque is a general problem with non-application level protocols. |
AFAIK most CNIs actually do not drop packets when the netpol changes. At
least that has been my experience a few years ago - worth surveying the
landscape
…On Mon, Jun 12, 2023 at 8:43 AM Kuat ***@***.***> wrote:
I think we're over-indexing on TCP handling in ztunnel. If it was working
on the packets, it would be completely natural to drop packets as soon as
the routing + rbac imply so. Since ztunnel has to support UDP and L3
eventually, it is the necessary thing to do. The fact that errors are
opaque is a general problem with non-application level protocols.
—
Reply to this email directly, view it on GitHub
<#311 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEYGXLL55YQFECKARFAZ7LXK42K5ANCNFSM6AAAAAATRCHGVI>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Fair enough. In general, delaying the application of authZ rules once you have them is a security hole. It's unclear to me what is gained, ultimately, by choosing to carry that security hole. |
I am not sure how can we be ware of the rbac failure after a HBONE request was sent, IIUC the authz is only done once for a tcp connection |
It seems there are (potentially) 2 difference scenarios here:
I think we're also able to say that item 1 isn't actually a real thing. Existence of a waypoint doesn't imply in anyway that traffic strictly must use it. User's should express this via auth API and thus item 2 is really the only problem we're concerned with. It could potentially be a property exposed via the SuperService whether or not we manage these policies for the user. |
I definitely agree that if we change something, we should not do anything if the new policies are allowed (only if they deny, then we close). So we would not want a traditional Envoy "drain". We still need to decide if we want to apply them to existing connections. I think we should |
@ilrudie ok to assign this to you? Can we identify action item for this in terms of beta for L4 first? Seems everyone agrees that we should do something when RBAC updates, even for long lived connections. |
@linsun, Sure! Happy to work on this. |
Added a topic to the ambient WG to try to finalize consensus here before diving into implementation. |
My opinion is for L7 we can deny future request, but for L4 there is no need, and also hard because the authz is applied per connection, while it is per request for L7 |
@hzxuzhonghu To be clear: you're saying you do want to close/drain existing connections, yeah? |
Not exactly, as to L4, the rbac is enforced once after connection established. We donot have the chance to close connection. For L7, as the connection is shared, the authz is for requests, there is no reason to close the connection. |
It should be possible to do though. I think we can pass a rx channel through to our proxy functions and then instead of awaiting proxy_to we can select on proxy_to and query the channel |
High level sketch, what I'm looking at right now is having a proxy instance be able to handle it's own connections. We'll either poll for auth updates or have a channel where something like a policy update can include notifying the routine that will close these channels (undecided). Each connection will have a channel it can check which will cause it to close and those close channels will likely be keyed by rbac::Connection in a shared hashmap so the proxy has a relatively easy time of determining which connections must be closed. It's pretty WIP but hope to have some code to peak at soon. |
But each connection we do assert_rbac once now. |
You're correct. At present we only assert_rbac once. The proposal is to have a way for the zt to assert_rbac on existing connections and then close/drain them via a channel. It's net new code/functionality for zt. |
Sorry it'll be a bit tougher to read since it's built on top of zt code which hasn't merged yet but #772 has a partial implementation of what I was thinking about in terms of being able to assert policy against connections after they were established. |
I've seen the behavior of not re-authorizing long-lived connections when AuthZ policy changes confuse users wondering why their rules don't seem to be effective - this is something we had wanted to implement in Consul but seemed to be prohibitively difficult for our architecture without Envoy changes. The implementation @ilrudie is describing sounds substantially aligned to what I think the optimal user experience should be, excited to see progress on this! |
I think you need to support draining at the tunnel protocol level to make enforcement tolerable in terms of UX. What I mean is that a connection that is destined to be terminated should signal the peers that are application aware to terminate it gracefully. Ztunnel cannot do it for HTTP but if it uses a waypoint, an HTTP connection to ztunnel should not be abruptly closed by ztunnel without asking the waypoint to send a GO_AWAY frame. Otherwise, this is a classical 503 scenario and not acceptable in production. |
It seems to me that layer targeted policy would be helpful here. If you target L4 then it's explicitly not application enforced and DROP seems like suitable behavior. If you want some sort of gentle application-aware handling then target L7 layer with your policy. |
+1 and we should already get that in most cases IIUC - if a policy is applied that should terminate a L7 connection, waypoint should terminate it, and ZT is none the wiser. If you nuke connections via policy at the L4 layer, regardless of whether the connection might be carrying a higher-level protocol, there's little we can do but drop it. I guess we could set up a backchannel to the waypoint to inform it and give it time to do something, but that adds complexity and I don't see it as a beta blocker. |
Yeah, it seems plausible to make but it would likely wind up a special behavior of the reference implementation that requires Istio ztunnel and waypoint. I'd agree re: Beta and add that having a good methodology for how to try to achieve this type of behavior without needing tightly coupled components is probably a strong start. |
+1 on this as well. I've observed DROP for L4 policy for CNI behaviors as well. |
Yeah, I think it's pretty clear that you can't drain L4 RBAC. So the answer to the issue title is "no", and the choice is how often to enforce policies to drop (at start or periodically). |
Indeed. "drain" and "rbac" here are probably not that ideal... #naming-hard |
If anyone was holding off on taking a look because the diff was huge I've rebased against master now that Yuval's inpod has merged and it's much more reasonable manageable. |
Istio's L4 AuthorizationPolicy functionality essentially overlaps with the Kubernetes networkpolicy. |
istio/istio#48876 Similar issue at the beginning of Ambient deploy/open. |
Since it's implementation-defined and not part of NetPol to dictate one or the other, we can do what makes sense for us. NetPol and kubernetes do not make the same security or policy guarantees or have the same scoping as Istio AuthPolicy. NetworkPolicy is fine for basics. Istio's L4 AuthPolicy is an optional extension that composes on top of k8s NetworkPolicy, providing both more policy controls and more security guarantees. It is entirely fine for Istio's AuthPolicy impl to be more strict than either the K8S NetworkPolicy standard, or a specific implementation of the K8S NetworkPolicy standard. |
The decision to forcibly terminate existing connections has both advantages and disadvantages. It ultimately depends on what aspects the users value more, and giving them the choice to decide would likely be a better approach. Perhaps an option to enforce connection termination could be added to the L4 RBAC policy. example in my pr: istio/istio#49106 |
It seems fundamentally different to me. In your pr the intent is to allow not severing existing connections when adding a wl to ambient. One could say the intent is clear here and there are plenty of controls to manage this BUT adding to ambient isn't an explicitly "destructive" operation in the way denying connectivity is. In the case of auth policy change it would be allowing a connection the user explicitly denied to continue which seems odd since they have taken an action specifically to deny this connectivity. User intent seems way less ambiguous when they've said "deny this connection". Louis's comments about disconnecting cause from effect also hold a lot of sway here IMO. If the user made a mistake it's better they find out when they make it vs finding out at some indeterminate point in the future. We do them no favors by allowing them to accidentally make a ticking time bomb. |
Hi @ilrudie - should we close this? I recall you implemented this already? |
Consider a long lived connection. After it is established, we create a rule that would have denied it. Should the connection be terminated, or should only new connections start being terminated?
Prior art:
The text was updated successfully, but these errors were encountered: