Feature: Faster cluster provisioning and autoscaling #1231

bernardhalas · 2024-02-19T08:47:31Z

Motivation

A significant portion of the cluster provisioning and autoscaler execution time takes the download of various packages and binaries. There's a room to optimize this part.

Description

We could speed this up by utilizing our pre-populated images on providers that allow this. This would result in a faster ansibler and kube-eleven execution as the binaries would already be present. And for the cases when not (e.g. on providers where we can't deploy our images or on static nodes), the usual ansibler and kube-eleven flows will take care of the download.

Note 1: We should assess this approach against custom pre-baked Flatcar, Fedora Core OS or OpenSuse MicroOS images.
Note 2: It would help tremendously in this task if we knew whether we can get rid of Wireguard and utilize just Cilium for bridging nodes across various networks.

Exit criteria

Figure out the workflow for custom images (naming conventions, validate upgrade/rollback scenarios)
Prepare custom images where applicable
Ensure that we have an automation in place for refreshing them

FYI @MiroslavRepka

The text was updated successfully, but these errors were encountered:

Danielss89 · 2024-02-20T20:19:27Z

Sounds cool.
Can Claudie check wether a node i using the image or not and figure out what to do?
For example, on OneProvider i can upload images which the servers i create can use. So it might be a dedicated server in Claudie, but still use the image.

fritz-net · 2024-07-03T21:34:13Z

may an alternative for networking could be kilo -> https://kilo.squat.ai/
since as far as I remember that claudie demands the nodes to be on public IPs anyway
I used kilo in a k8s cluster which I setup with kubeadmin

izzm · 2024-12-20T12:25:34Z

Is there any updates about this feature?

bernardhalas · 2024-12-30T13:56:29Z

Hi folks, would you please indicate what's more important to you to speed-up at this stage? Is it a cluster provisioning duration or scale-up/scale-down event?

There are currently two options we're looking into:

Claudie takes a vanilla Ubuntu which it starts configuring and installing packages into. Instead of vanilla Ubuntu, we can use a pre-baked Claudie-Ubuntu flavor (i.e. all the necessary packages and binaries being a part of the image). This would be a low-hanging fruit, but faster to implement.
Alternatively, we can start considering a custom OS image, which would not be Ubuntu-based, but instead using a container-native linux OS, like Flatcar, MicroOS or similar. This would be a larger modification as it would require a rewrite of the KubeEleven service and leave out KubeOne completely.

The answer to the initial question would help us prioritize better.

Danielss89 · 2024-12-30T15:28:37Z

It's definitely scale-up/scale-down event(autoscaling) for me, as the initial provision is not so critical.

bernardhalas · 2025-01-06T21:55:19Z

I've measured the autoscaler speed on GCP. Adding a node takes around 8m30s from the first pod in a Pending status. By creating a Claudie-flavor of the base Ubuntu image pre-loaded with the necessary packages and binaries (there are around 18 packages installed by ansibler and kube-eleven), we could shave off likely no more than 30 seconds. This is a marginal improvement.

The scale-up performance would be likely slower on Hetzner, as I don't expect the Hetzner Ubuntu images coming pre-configured for Hetzner-hosted APT repository mirrors like GCP does, so the impact might be more significant there. But we won't get to a faster scale-up than 8m0s. So if this is not sufficient, we need to look for other alternatives.

bernardhalas added the feature New feature label Feb 19, 2024

bernardhalas mentioned this issue Nov 18, 2024

Feature: Ditch ubuntu for TalOS #1580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Faster cluster provisioning and autoscaling #1231

Feature: Faster cluster provisioning and autoscaling #1231

bernardhalas commented Feb 19, 2024 •

edited

Loading

Danielss89 commented Feb 20, 2024

fritz-net commented Jul 3, 2024

izzm commented Dec 20, 2024

bernardhalas commented Dec 30, 2024

Danielss89 commented Dec 30, 2024

bernardhalas commented Jan 6, 2025 •

edited

Loading

Feature: Faster cluster provisioning and autoscaling #1231

Feature: Faster cluster provisioning and autoscaling #1231

Comments

bernardhalas commented Feb 19, 2024 • edited Loading

Motivation

Description

Exit criteria

Danielss89 commented Feb 20, 2024

fritz-net commented Jul 3, 2024

izzm commented Dec 20, 2024

bernardhalas commented Dec 30, 2024

Danielss89 commented Dec 30, 2024

bernardhalas commented Jan 6, 2025 • edited Loading

bernardhalas commented Feb 19, 2024 •

edited

Loading

bernardhalas commented Jan 6, 2025 •

edited

Loading