Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Faster cluster provisioning and autoscaling #1231

Open
3 tasks
bernardhalas opened this issue Feb 19, 2024 · 6 comments
Open
3 tasks

Feature: Faster cluster provisioning and autoscaling #1231

bernardhalas opened this issue Feb 19, 2024 · 6 comments
Labels
feature New feature

Comments

@bernardhalas
Copy link
Member

bernardhalas commented Feb 19, 2024

Motivation

A significant portion of the cluster provisioning and autoscaler execution time takes the download of various packages and binaries. There's a room to optimize this part.

Description

We could speed this up by utilizing our pre-populated images on providers that allow this. This would result in a faster ansibler and kube-eleven execution as the binaries would already be present. And for the cases when not (e.g. on providers where we can't deploy our images or on static nodes), the usual ansibler and kube-eleven flows will take care of the download.

Note 1: We should assess this approach against custom pre-baked Flatcar, Fedora Core OS or OpenSuse MicroOS images.
Note 2: It would help tremendously in this task if we knew whether we can get rid of Wireguard and utilize just Cilium for bridging nodes across various networks.

Exit criteria

  • Figure out the workflow for custom images (naming conventions, validate upgrade/rollback scenarios)
  • Prepare custom images where applicable
  • Ensure that we have an automation in place for refreshing them

FYI @MiroslavRepka

@bernardhalas bernardhalas added the feature New feature label Feb 19, 2024
@Danielss89
Copy link
Contributor

Sounds cool.
Can Claudie check wether a node i using the image or not and figure out what to do?
For example, on OneProvider i can upload images which the servers i create can use. So it might be a dedicated server in Claudie, but still use the image.

@fritz-net
Copy link

may an alternative for networking could be kilo -> https://kilo.squat.ai/
since as far as I remember that claudie demands the nodes to be on public IPs anyway
I used kilo in a k8s cluster which I setup with kubeadmin

@izzm
Copy link

izzm commented Dec 20, 2024

Is there any updates about this feature?

@bernardhalas
Copy link
Member Author

Hi folks, would you please indicate what's more important to you to speed-up at this stage? Is it a cluster provisioning duration or scale-up/scale-down event?

There are currently two options we're looking into:

  1. Claudie takes a vanilla Ubuntu which it starts configuring and installing packages into. Instead of vanilla Ubuntu, we can use a pre-baked Claudie-Ubuntu flavor (i.e. all the necessary packages and binaries being a part of the image). This would be a low-hanging fruit, but faster to implement.
  2. Alternatively, we can start considering a custom OS image, which would not be Ubuntu-based, but instead using a container-native linux OS, like Flatcar, MicroOS or similar. This would be a larger modification as it would require a rewrite of the KubeEleven service and leave out KubeOne completely.

The answer to the initial question would help us prioritize better.

@Danielss89
Copy link
Contributor

It's definitely scale-up/scale-down event(autoscaling) for me, as the initial provision is not so critical.

@bernardhalas
Copy link
Member Author

bernardhalas commented Jan 6, 2025

I've measured the autoscaler speed on GCP. Adding a node takes around 8m30s from the first pod in a Pending status. By creating a Claudie-flavor of the base Ubuntu image pre-loaded with the necessary packages and binaries (there are around 18 packages installed by ansibler and kube-eleven), we could shave off likely no more than 30 seconds. This is a marginal improvement.

The scale-up performance would be likely slower on Hetzner, as I don't expect the Hetzner Ubuntu images coming pre-configured for Hetzner-hosted APT repository mirrors like GCP does, so the impact might be more significant there. But we won't get to a faster scale-up than 8m0s. So if this is not sufficient, we need to look for other alternatives.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

No branches or pull requests

4 participants