-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4224: Replace kube-up e2e clusters with kops clusters #4250
base: master
Are you sure you want to change the base?
Conversation
upodroid
commented
Sep 28, 2023
- One-line PR description: Replace kube-up e2e clusters with kops clusters
- Issue link: Replace kube-up e2e clusters with kops for kubernetes/kubernetes e2e testing #4224
information to express the idea and why it was not acceptable. | ||
--> | ||
|
||
We don't have one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was using Cluster API considered? I remember there was an issue open in k/k (can't seem to find the link about deprecating the cluster/ directory and using cluster-api for tests
kOps doesn't support as many infrastructure providers (everything besides AWS and GCE are alpha or beta) so it may be less versatile for running k/k tests across different clouds in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't articulate this, but CAPI requires a kubernetes cluster to create the test cluster which adds an extra complication. Also, is it trivial to fix cluster bootstrap business logic? For example, this is a list of bugs I fixed/fixing in kops https://github.com/kubernetes/kops/pulls?q=is%3Apr+author%3Aupodroid
Also, CAPG isn't well maintained. This is what I found when I took CAPG for a spin at the end of 2022 vmware-tanzu/crash-diagnostics#243
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant historical slack thread:
https://kubernetes.slack.com/archives/C2C40FMNF/p1657180208598469
Google docs link to previous discussion:
Looks like this never got an owner, lemme know if I can help with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't articulate this, but CAPI requires a kubernetes cluster to create the test cluster which adds an extra complication
A common way to get around this is to use a kind cluster as bootstrap cluster. All the k/k cloud-provider Azure tests create test k8s clusters using CAPZ currently: https://testgrid.k8s.io/provider-azure-master-signal
Also, CAPG isn't well maintained
@dims @richardcase @cpanato are the maintainers of CAPG, they can comment on the project maturity.
An alternative would be to run most tests using Docker which doesn't require spinning up any cloud infrastructure (less $$ and faster), this how core Cluster API runs all its tests https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one more aspect to using kOps for testing purposes, which is less complexity. kOps creates the cloud resources starts the K8s cluster and then it gets out of the way. There are no controllers that try to reconcile things in the background. This makes it easy to figure out what is happening with failing tests or broken clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@upodroid please consider recording the points above in the "alternatives" section of this KEP. "We don't have one" doesn't seem factual given the discussion above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add a summary of this thread to the alternatives section and join the CAPI meeting this afternoon to say hello and answer some questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing the feedback at the Cluster API office hours:
- It'd be great to repurpose the KEP to be focused on improving the current e2e test suites, and not rely on a particular tool to run them (e.g. kube-up or kops).
- Cluster API and the major providers like AWS, GCP, Azure, and vSphere are all currently running conformance test suites. @fabriziopandini will coordinate and open issue with the different providers to start running the other e2e suites like the ones shown by @upodroid in https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops
- The Cluster API bootstrap cluster problem can potentially be solved in a number of different ways, which the CAPI community and myself can help with; with the end goal in mind of reducing costs.
@upodroid @CecileRobertMichon @fabriziopandini did I miss anything from the summary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related note for CAPG maintainers (@dims @cpanato @richardcase): it'd be good to understand what's the delta of features we'd need to run the e2e suites on CAPG. Seems the provider has made lots of progress in the past coming months, and I can carve out some time to help as well.
keps/sig-testing/4224-replace-kubeup-clusters-with-kops/README.md
Outdated
Show resolved
Hide resolved
- The shell scripts are very fragile and no new features are being accepted | ||
- arm64 testing needs to be done on GCE | ||
- python and debian upgrades always break the Kubernetes e2e tests, particularly at a bad time(cutting releases, reverting/patching a critical bug). This will no longer be the case once we move to kops. For example, | ||
- We have some tests in kubernetes that make assumptions about a specific pieces of cloud infrastructure and are not a good fit for kubernetes e2e tests. Tests that rely on cloud-infra that are not reachable will be removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests that rely on cloud-infra that are not reachable will be removed.
can you please expand on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think he means things like these kubernetes/kubernetes#120968
@upodroid i'd like us to run these jobs on both GCE and EC2 in parallel as well |
@dims @BenTheElder what would it take for us to run these tests on Azure as well? cc @lachie83 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @upodroid
information to express the idea and why it was not acceptable. | ||
--> | ||
|
||
We don't have one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this thread is missing the point.
This KEP is about eliminating kube-up and kube-up jobs, by shifting to existing production-grade coverage.
Other CI coverage will continue to exist alongside this and folks are welcome to continue to invest in that.
kOps has already for year provided reliable CI coverage for the vendors that actually provide us credits.
Future cloud vendors is not the problem, we're many many years into the credits program and moving CI out of google.com, and yet GCP+AWS are the only vendors that actually provide credits. Of those two, with apologies, CAPG support is just not there.
Bootstrap is not an issue, you can use kind or even lighter options like https://github.com/fabriziopandini/kBB-8 (8s to start provisioning a cluster)
As a kind maintainer ... that's actually unacceptably expensive. KIND is only cheaper when we don't have to run the cloud cluster. If we have to run a kind cluster for every cloud e2e cluster the CI costs are going to get out of hand.
e2e tests can run for a long time and generally require almost no resources in the CI cluster, only the cluster under tests.
An alternative would be to run most tests using Docker which doesn't require spinning up any cloud infrastructure (less $$ and faster), this how core Cluster API runs all its tests https://github.com/kubernetes-sigs/cluster-api/blob/main/test/infrastructure/docker/README.md
Again, actually much more expensive than kOps. The kind clusters are not free either, just cheaper than a cloud cluster. But one cloud cluster is cheaper than two cloud clusters or a cloud cluster + a kind cluster.
kops has an Azure implementation, but it is alpha and I haven't used it. For EC2, I would duplicate these jobs https://testgrid.k8s.io/sig-cluster-lifecycle-kubeup-to-kops to run on EC2 and Ubuntu I'm expecting fewer failing tests as the aws cloud-provider is better maintained and kubernetes test suite runs fewer tests against the aws clusters
|
Azure cannot be used for release blocking because those accounts are a mystery with no budget and are not owned by the project. We've been bitten badly by this in the past with PR blocking kops-aws running out of money and shutting down before we had community controlled accounts. EDIT: Note that this sub-topic is a very old discussion we've been having for years now. So we shouldn't block phasing out kube-up on revisiting this for the Nth time, though I'd be happy to see this change anyhow :-) |
@CecileRobertMichon we'll need to setup a credits program where infra is run by volunteers similar to what we do with GCP and AWS. |
Some context on the origin of this KEP to try to focus the discussion There is a group of developers commited to support arm CI kubernetes/test-infra#29693, and there was a PR to add support to the existing cluster scripts for arm kubernetes/kubernetes#120144 I asked them to look for some existing tool that runs kubernetes on arm, and it turned out kops already had a CI job for running on arm kubernetes/kubernetes#78995 (comment) In another note, there was another effort to migrate scalability jobs to aws and these developers choose kops too kubernetes/test-infra#29139 Since the migration of jobs to kops was getting traction and showing results, I asked @upodroid to open a KEP for visibility and for defining the criteria and the plan to make this effort sustainable in the long term and how we can do a smooth transition. One important thing I want to highlight is that the CI of kubernetes is for testing kubernetes, not for testing the tool that test kubernetes at the same time, if these new jobs start to flake or are unstable because of the errors or incompatibilites on the installer and these errors are not promptly fixed or opened against kubernetes developers, we'll revisit this decision. |
|
||
## Scalability | ||
|
||
kops is already used to run scale tests on AWS. We can use it to replace the kube-up scale tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test on AWS is still visibly less stable than the GCE one:
https://testgrid.k8s.io/kops-misc#ec2-master-scale-performance
vs
https://testgrid.k8s.io/sig-scalability-gce#gce-master-scale-performance
It's by far not clear to me whether it's not because e.g. some settings that are critical to achieve reasonable performance at scale are e.g. not set correctly in kops.
@shyamjvs - FYI
Note - I'm supported for the effort itself, I'm just saying that it may not be as straightforward as you think...
In general, what I would like to see is a diff for flags [fortunately all our components log them on startup] for the existing and new jobs and show that this diff is zero :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kops does expose most of the settings but they do have values that are different from what is set in kube-up clusters.
kubernetes/kops#15982 tries to close the apiserver differences
I'll look at the scale tests once the serial, disruptive and alpha tests are stabilised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use visibly more for scalability tests - you may want to look at:
https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presets.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an example of blocker
@CecileRobertMichon kOps has support for Azure, not perfect, but quite good since recently. |
What's the future of Kubemark in a such setting? AFAIK it's only deployable through kube-up scripts. /cc @marseel |
@jprzychodzen fwiw, we do have an actively maintained kubemark provider for cluster-api, https://github.com/kubernetes-sigs/cluster-api-provider-kubemark . that is another possibility for creating kubemark nodes, since cluster api has come up in discussion here. |
Here's an old write up when we were discussing options - https://hackmd.io/pw1kt61lRM-wZh5MU1G_QQ for the record. |
/label tide/merge-method-squash |
proposal will be implemented, this is the place to discuss them. | ||
--> | ||
|
||
We'll create new prowjobs that use kops clusters to run cluster e2e testing. kops has been used for e2e testing for a long time but it runs a narrower set of tests designed for testing Kubernetes Distributions with various components. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see a difference between components installed in kube-up and in kops, what cni is being used per example? kube-up is pretty neat, I don't like to end with calico or cilium in the critical path, per example
We should migrate it to this, however it's worth pointing out that kube-up.sh is already long disowned, deprecated, and removed as a subproject by SIG Cluster Lifecycle and ad-hoc maintained by a handful of ~SIG Testing folks because so much CI uses it. We need to phase out the bulk of CI using it, and eventually the rest. So I would say kubemark jobs are already at risk by not migrating off of it. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/reopen |
@ameukam: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: upodroid The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |