-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gossip traffic service / Istio injection #333
Comments
Hey @mjnagel, thanks for the detailed report and analysis. First off, your reported message behavior where some of them are temporarily missing until a hard refresh occurs is the most common sign that at least some of the Mattermost server clustering messages are not fully propagating. In these cases, users will properly receive messages in realtime from other users who are connected to the same server instance as them, but not other server instances. All this is to say that your analysis is correct that something is blocking Mattermost clustering from fully working in your configuration. As Mattermost pod replicas are in the same namespace, we haven't encountered this issue ourselves, but our k8s cluster networking stack is different and we don't use Istio. Because cluster gossip messaging is part of the "base Mattermost experience" I am leaning towards trying to support this directly in the operator. @fmartingr or @mirshahriar do either of you have additional input on this? If not, we can get this added. @mjnagel can you confirm that you only needed to add the new service+port and you didn't need to change the Mattermost container port config?
|
@gabrieljackson that is correct, we just added that service on its own, no other modifications to deployment, etc. We've had this deployed in prod (istio injected with the extra service) for a few weeks now and haven't seen any adverse effects or other issues for what its worth 😄 |
Thanks for confirming! We will prioritize getting this option added into the operator. |
Summary
When Istio injecting Mattermost I noticed inconsistency with real-time chat when using >1 replica. Chats between users would not always be delivered, until a browser refresh (to force sync from database).
After a lot of debugging I narrowed this down to issues with Gossip traffic between replicas, and Istio interrupting this traffic. I was able to resolve this issue in two different ways:
traffic.sidecar.istio.io/excludeInboundPorts: "8074"
)tcp-gossip
(to force istio protocol selection)).The second path here is more ideal to ensure all traffic is passing through the sidecar. As a temporary fix our team is manually deploying a second service to resolve this issue:
We also confirmed that adding the port to the existing operator managed service works (by
kubectl edit
-ing the service).Steps to reproduce
Expected behavior
I would expect users to be able to chat in realtime with no dropped messages.
Observed behavior (that appears unintentional)
Some percentage of chat messages (around or less than 50%) are not sent in realtime, browser must be refreshed to view.
Possible fixes
The solution I think would be best is to add the gossip port to the existing operator managed service (or conditionally deploy a second service for just gossip traffic like in my example above). I'm wondering if others have run into this issue or if there may be other solutions to this issue - it seems that the solve in my case was to force Istio to handle this traffic as TCP (rather than auto-detecting, which maybe was detecting some traffic incorrectly?). I'm sure there's a balance here in adding very deployment specific pieces to the operator since this gossip service/port would not be required in most scenarios.
Happy to help further reproduce the issue we were seeing and work on a PR to add this functionality, but wanted to start by opening a discussion in this issue. I'm not sure if this is the best place to raise this issue, but it fit here for my use case since we deploy with the operator - happy to move this elsewhere if desired.
The text was updated successfully, but these errors were encountered: