Provide Specs for Kubernetes Cluster #27

winmillwill · 2020-09-01T13:18:35Z

I'm interested in running anna and cloudburst in a different flavor of Kubernetes, specifically GKE. From skimming the repo I see mention of a mesh and an ELB. Other than that I see you are using kops rather than the AWS managed Kubernetes offering, so I wonder if there are specific needs in terms of control of the kubernetes components, the virtual machines, or the network(s) between them.

vsreekanti · 2020-09-01T15:46:29Z

Hi @winmillwill -- thanks for your question. You're right that we have been using kops instead of EKS/GKE/etc. The main reason is that when we started this work, hosted k8s services didn't exist or weren't in GA yet. 🙂

The easy part of porting to a managed k8s service is the YAML specs for each service component -- you can find these in hydro/cluster/yaml. For the most part, they are using general k8s constructs (except for some kops gunk around mounting external storage), so you should be able to deploy those relatively seamlessly.

The parts that will require work are in hydro/management. These components make autoscaling and fault-tolerance decisions by using the kops APIs to check for failed machines and increase/decrease cluster sizes. I have less of a sense of how EKS/GKE manage these tasks, but my hunch is that you would need to replace some of those components with vendor-specific APIs to interact with the metadata exposed by the specific services.

I'm happy to chat more if this is something you are going to be working on!

winmillwill · 2020-09-07T15:11:02Z

Thanks for the info and the pointers.

GKE has optional autoscaling built in -- you just tick a checkbox and if the total requested resources across pods exceed what is available, then you get another node. I don't know whether it has limitations that would require a more involved approach for some use-cases, nor do I know about whether other services provide a comparable feature or any limitations there. I would lean toward making autoscaling optional. It seems like scaling the number of nodes, automatically or otherwise, is going to be decided by whoever administers the k8s cluster, and adoption and experimentation is easier if the user doesn't have to be a cluster administrator.

From looking around in hydro/management it seems like the only service dependencies other than the zmq communication are aws/kops for autoscaling and kubernetes api for detecting membership changes and for copying config files. I think the kubernetes API dependency can be removed. From what I can tell, the file that is copied into new pods doesn't change at runtime, so it could just be a config map that each pod mounts, or it could just be built into the container image. Regarding detecting membership, we can just do a dns query on a headless service.

Another issue I see is that the ebs daemonsets expect the disks that are attached by kops provisioning the ig. It would be more portable if we instead used a statefulset that took advantage of the cloud providers facilities for provisioning a volume dynamically. Is there a known issue with that approach?

Regarding the general use of daemonsets and hostNetwork, is the idea to avoid the additional latency of iptables and so on? I'm also curious about the use of hostIPC and whether the non-stateful daemonsets can be deployments instead and which containers depend on being on the same node as some other container.

vsreekanti · 2020-09-10T16:14:50Z

100% agreed re: removing autoscaling from the purview of the user and relying on out-of-the-box autoscaling as much as possible. Like I said, those services weren't available when we started this project, so we weren't able to take advantage.

You're right that AWS/kops for autoscaling and membership changes are the primary use cases. The config file could be changed by the system administrator based on their preferences, so it probably shouldn't be built into the image. Mounting a config map sounds promising, but I am not familiar enough with the k8s-side API to say for sure that it would work.

I'm not familiar with how stateful sets interact with cloud provider provisioning of disks. If you have some pointers about that, I'm happy to take a look.

Regarding the use of hostNetwork, the primary reason was that the KVS-layer requires IP-level access outside of the cluster. The reduced latency was an added benefit. The containers which require being on the same nodes as each other are the function executors and the caches, which are all part of the same daemon set. Beyond that, we don't have any colocation requirements.

winmillwill · 2020-09-27T23:30:10Z

Sorry for the delay in replying.

The way a statefulset gets disks is that a statefulset has a PersistentVolumeClaimTemplate that will provide a PersistentVolumeClaim for each pod, which will result in a PersistentVolume getting created. The cloud provider will hook in here much like it does when you create a LoadBalancer service. You can tune this by creating StorageClass resources that spec how to create the disks and those can be referenced from a PVC (and from a PVC Template in a StatefulSet). This arrangement allows for the operator of the workload to just specify the number of replicas and not necessarily have to add taints to nodes or tolerations for those taints to the workload, and not need to care particularly about which nodes in the cluster the workload pods get scheduled on, above the normal anti-affinity rules for making sure that eg pods go in different AZs. The docs on the disk machinery are here: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

For a configmap, the API you consume is essentially a yaml object where the keys are file names, the values are file contents, and in each container in the pod spec you can choose a directory to mount that configmap and it's files will be in that directory. In the application you can watch for filesystem events or just continuously check the interesting file paths to determine if the config needs to be reloaded, or you can put it on the operator to ensure the pods get cycled after a change to the configmap.

IIUC, the issue with IP-level access is that much like with cassandra, we need to do client-side loadbalancing. At my day job, we provide a LoadBalancer service for each cassandra pod that needs out-of-cluster access and provide a headless service for the statefulset that runs the cassandra pods so that in-cluster clients can just use the headless service and out-of-cluster clients can use the DNS records or IP addresses of the LoadBalancers. Different organizations can handle this differently by eg putting their kubernetes cluster in the same VPC or peered with the VPC of the out-of-cluster clients.

From what we've discussed I feel like I can work up a POC for some things I'm thinking about. Thanks for the help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide Specs for Kubernetes Cluster #27

Provide Specs for Kubernetes Cluster #27

winmillwill commented Sep 1, 2020

vsreekanti commented Sep 1, 2020

winmillwill commented Sep 7, 2020

vsreekanti commented Sep 10, 2020

winmillwill commented Sep 27, 2020

Provide Specs for Kubernetes Cluster #27

Provide Specs for Kubernetes Cluster #27

Comments

winmillwill commented Sep 1, 2020

vsreekanti commented Sep 1, 2020

winmillwill commented Sep 7, 2020

vsreekanti commented Sep 10, 2020

winmillwill commented Sep 27, 2020