From 8622181f4bc1a706ad0347049a03353b36b899d3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nathan=20Gau=C3=ABr?= Date: Wed, 4 Dec 2024 18:10:35 +0100 Subject: [PATCH 1/4] [CI] Update documents around the GCP runners --- premerge/README.md | 28 ++------ premerge/architecture.md | 103 +++++++++++++++++++++++++++ premerge/cluster-management.md | 99 ++++++++++++++++++++++++++ premerge/docs.md | 84 ++++++++++++++++++++++ premerge/issues.md | 126 +++++++++++++++++++++++++++++++++ premerge/monitoring.md | 21 ++++++ 6 files changed, 438 insertions(+), 23 deletions(-) create mode 100644 premerge/architecture.md create mode 100644 premerge/cluster-management.md create mode 100644 premerge/docs.md create mode 100644 premerge/issues.md create mode 100644 premerge/monitoring.md diff --git a/premerge/README.md b/premerge/README.md index b01652d87..b371ade4f 100644 --- a/premerge/README.md +++ b/premerge/README.md @@ -5,27 +5,9 @@ resources used to run the premerge checks. Currently, only Google employees with access to the GCP project where these checks are hosted are able to apply changes. Pull requests from anyone are still welcome. -## Setup +## Index -- install terraform (https://developer.hashicorp.com/terraform/install?product_intent=terraform) -- get the GCP tokens: `gcloud auth application-default login` -- initialize terraform: `terraform init` - -To apply any changes to the cluster: -- setup the cluster: `terraform apply` -- terraform will list the list of proposed changes. -- enter 'yes' when prompted. - -## Setting the cluster up for the first time - -``` -terraform apply -target google_container_node_pool.llvm_premerge_linux_service -terraform apply -target google_container_node_pool.llvm_premerge_linux -terraform apply -target google_container_node_pool.llvm_premerge_windows -terraform apply -``` - -Setting the cluster up for the first time is more involved as there are certain -resources where terraform is unable to handle explicit dependencies. This means -that we have to set up the GKE cluster before we setup any of the Kubernetes -resources as otherwise the Terraform Kubernetes provider will error out. +- [Architecture overview](architecture.md) +- [Cluster management](cluster-management.md) +- [Monitoring](monitoring.md) +- [Past issues](issues.md) diff --git a/premerge/architecture.md b/premerge/architecture.md new file mode 100644 index 000000000..554cd7e87 --- /dev/null +++ b/premerge/architecture.md @@ -0,0 +1,103 @@ +# LLVM Premerge infra - GCP runners + +This document describes how the GCP based presubmit infra is working, and +explains common maintenance actions. + +--- +NOTE: As of today, only Googlers can administrate the cluster. +--- + +## Overview + +Presubmit tests are using GitHub workflows. Executing GitHub workflows can be +done in two ways: + - using GitHub provided runners. + - using self-hosted runners. + +GitHub provided runners are not very powerful, and have limitations, but they +are **FREE**. +Self hosted runners are self-hosted, meaning they can be large virtual +machines running on GCP, very powerful, but **expensive**. + +To balance cost/performance, we keep both types. + - simple jobs like `clang-format` shall run on GitHub runners. + - building & testing LLVM shall be done on self-hosted runners. + +LLVM has several flavor of self-hosted runners: + - libcxx runners. + - MacOS runners managed by Microsoft. + - GCP windows/linux runners managed by Google. + +This document only focuses on Google's GCP hosted runners. + +Choosing on which runner a workflow runs is done in the workflow definition: + +``` +jobs: + my_job_name: + # Runs on expensive GCP VMs. + runs-on: llvm-premerge-linux-runners +``` + +Our self hosted runners come in two flavors: + - Linux + - Windows + +## GCP runners - Architecture overview + +Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). +The cluster has 3 nodes: + - llvm-premerge-linux + - llvm-premerge-linux-service + - llvm-premerge-windows + +**llvm-premerge-linux-service** is a fixed node, only used to host the +services required to manage the premerge infra (controller, listeners, +monitoring). Today, this node has only one `e2-small` machine. + +**llvm-premerge-linux** is a auto-scaling node with large `c2d-highcpu-56` +VMs. This node runs the Linux workflows. + +**llvm-premerge-windows** is a auto-scaling node with large `c2d-highcpu-56` +VMs. Similar to the Linux node, but this time it runs Windows workflows. + +### Service node: llvm-premerge-linux-service + +This node runs all the services managing the presubmit infra. + - Action Runner Controller + - 1 listener for the Linux runners. + - 1 listener for the windows runners. + - Grafana Alloy to gather metrics. + +The Action Runner Controller listens on the LLVM repository job queue. +Individual jobs are then handled by the listeners. + +How a job is run: + - The controller informs GitHub the self-hosted runner is live. + - A PR is uploaded on GitHub + - The listener finds a Linux job to run. + - The listener creates a new runner pod to be scheduled by Kubernetes. + - Kubernetes adds one instance to the Linux node to schedule new pod. + - The runner starts executing on the new node. + - Once finished, the runner dies, meaning the pod dies. + - If the instance is not reused in the next 10 minutes, Kubernetes will scale + down the instance. + +### Worker nodes : llvm-premerge-linux, llvm-premerge-windows + +To make sure each runner pod is scheduled on the correct node (linux or +windows, avoiding the service node), we use labels & taints. +Those taints are configured in the +[ARC runner templates](linux_runners_values.yaml). + +The other constraints we define are the resource requirements. Without +information, Kubernetes is allowed to schedule multiple pods on the instance. +This becomes very important with the container/runner tandem: + - the container HAS to run on the same instance as the runner. + - the runner itself doesn't request many resources. +So if we do not enforce limits, the controller could schedule 2 runners on +the same instance, forcing containers to share resources. +Resource limits are defined in 2 locations: + - [runner configuration](linux_runners_values.yaml) + - [container template](linux_container_pod_template.yaml) + diff --git a/premerge/cluster-management.md b/premerge/cluster-management.md new file mode 100644 index 000000000..0964964fb --- /dev/null +++ b/premerge/cluster-management.md @@ -0,0 +1,99 @@ +# Cluster configuration + +The cluster is managed using Terraform. The main configuration is +[main.tf](main.tf). + +--- +NOTE: As of today, only Googlers can administrate the cluster. +--- + +Terraform is a tool to automate infrastructure deployment. Basic usage is to +change this configuration and to call `terraform apply` make the required +changes. +Terraform won't recreate the whole cluster from scratch every time, instead +it tries to only apply the new changes. To do so, **Terraform needs a state**. + +**If you apply changes without this state, you might break the cluster.** + +The current configuration stores its state into a GCP bucket. + + +## Accessing Google Cloud Console + +This web interface is the easiest way to get a quick look at the infra. + +--- +IMPORTANT: cluster state is managed with terraform. Please DO NOT change +shapes/scaling, and other settings using the cloud console. Any change not +done through terraform will be at best overridden by terraform, and in the +worst case cause an inconsistent state. +--- + +The main part you want too look into is `Menu > Kubernetes Engine > Clusters`. + +Currently, we have 3 clusters: + - `llvm-premerge-checks`: the cluster hosting BuildKite Linux runners. + - `windows-cluster`: the cluster hosting BuildKite Windows runners. + - `llvm-premerge-prototype`: the cluster for those GCP hoster runners. + +Yes, it's called `prototype`, but that's the production cluster. + +To add a VM to the cluster, the VM has to come from a `pool`. A `pool` is +a group of nodes withing a cluster that all have the same configuration. + +For example: +A pool can say it contains at most 10 nodes, each using the `c2d-highcpu-32` +configuration (32 cores, 64GB ram). +In addition, a pool can `autoscale` [docs](https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler). + +If you click on `llvm-premerge-prototype`, and go to the `Nodes` tab, you +will see 3 node pools: +- llvm-premerge-linux +- llvm-premerge-linux-service +- llvm-premerge-windows + +Definition for each pool is in [Architecture overview](architecture.md). + +If you click on a pool, example `llvm-premerge-linux`, you will see one +instance group, and maybe several nodes. + +Each created node must be attached to an instance group, which is used to +manage a group of instances. Because we use automated autoscale, and we have +a basic cluster, we have a single instance group per pool. + +Then, we have the nodes. If you are looking at the panel during off hours, +you might see no nodes at all: when no presubmit is running, no VM is on. +If you are looking at the panel at peak time, you should see 4 instances. +(Today, autoscale is capped at 4 instances). + +If you click on a node, you'll see the CPU usage, memory usage, and can access +the logs for each instance. + +As long as you don't click on actions like `Cordon`, `Edit`, `Delete`, etc, +navigating the GCP panel should not cause any harm. So feel free to look +around to familiarize yourself with the interface. + +## Setup + +- install terraform (https://developer.hashicorp.com/terraform/install?product_intent=terraform) +- get the GCP tokens: `gcloud auth application-default login` +- initialize terraform: `terraform init` + +To apply any changes to the cluster: +- setup the cluster: `terraform apply` +- terraform will list the list of proposed changes. +- enter 'yes' when prompted. + +## Setting the cluster up for the first time + +``` +terraform apply -target google_container_node_pool.llvm_premerge_linux_service +terraform apply -target google_container_node_pool.llvm_premerge_linux +terraform apply -target google_container_node_pool.llvm_premerge_windows +terraform apply +``` + +Setting the cluster up for the first time is more involved as there are certain +resources where terraform is unable to handle explicit dependencies. This means +that we have to set up the GKE cluster before we setup any of the Kubernetes +resources as otherwise the Terraform Kubernetes provider will error out. diff --git a/premerge/docs.md b/premerge/docs.md new file mode 100644 index 000000000..3e2d18326 --- /dev/null +++ b/premerge/docs.md @@ -0,0 +1,84 @@ +# LLVM Premerge infra - GCP runners + +This document describes how the GCP based presubmit infra is working, and +explains common maintenance actions. + +--- +NOTE: As of today, only Googlers can administrate the cluster. +--- + +## Overview + +Presubmit tests are using GitHub workflows. Executing GitHub workflows can be +done in two ways: + - using GitHub provided runners. + - using self-hosted runners on GCP. + +GitHub provided runners are not very powerful, and have limitations, but they +are **FREE**. +Self hosted runners are large virtual machines, very powerful, but they are +**expensive**. + +To balance cost/performance, we keep both runners. + - simple jobs like `clang-format` shall run on GitHub runners. + - building & testing LLVM shall be done on self-hosted runners. + +The choice between self-hosted & GitHub runners is done in the workflow +definition: + +``` +jobs: + my_job_name: + # Runs on expensive GCP VMs. + runs-on: llvm-premerge-linux-runners +``` + +Our self hosted runners come in two flavors: + - linux + - windows + +## GCP runners - Architecture overview + +Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). +The cluster has 3 nodes: + - llvm-premerge-linux + - llvm-premerge-linux-service + - llvm-premerge-windows + +**llvm-premerge-linux-service** is a fixed node, only used to host the +services required to manage the premerge infra (controller, listeners, +monitoring). Today, this node has only one e2-small machine. + +**llvm-premerge-linux** is a auto-scaling node with large c2d-highcpu-56 VMs. +This node runs the linux workflows. + +**llvm-premerge-windows** is a auto-scaling node with large c2d-highcpu-56 VMs. +Similar to the linux node, but this time it runs Windows workflows. + +### Service node: llvm-premerge-linux-service + +This node runs all the services managing the presubmit infra. + - Action Runner Controller + - 1 listener for the linux runners. + - 1 listener for the windows runners. + - Grafana Alloy to gather metrics. + + +The Action Runner Controller listens on the LLVM repository job queue. +Individual jobs are then handled by the listeners. + +How a job is run: + - The controller informs GitHub the self-hosted runner is live. + - A PR is uploaded on GitHub + - The listener finds a linux job to run. + - The listener creates a new runner pod to be scheduled by Kubernetes. + - Kubernetes adds one instance to the linux node to schedule new pod. + - The runner starts executing on the new node. + - Once finished, the runner dies, meaning the pod dies. + - If the instance is not reused in the next 10 minutes, Kubernetes will scale + down the instance. + +To make sure each pod is scheduled on the correct node (linux or windows, +avoiding the service node), we use labels & tains. +Those tains are configured in the [ARC runner templates](premerge/linux_runners_values.yaml). + diff --git a/premerge/issues.md b/premerge/issues.md new file mode 100644 index 000000000..3b724d206 --- /dev/null +++ b/premerge/issues.md @@ -0,0 +1,126 @@ +# Past Issues + +This document lists past issues that could be of interest if you encounter +issues with the cluser/presubmit. + + +## Workflows are failing: DNS resolution of github.com fails. + +### Date: 2025-01-27 + +### Symptoms: + +We noticed GitHub jobs were failing, and the logs showed `hostname lookup +failed for github.com`. + +### Investigation + +Initial steps were to check the github status page. No outages. +Then, we checked internal incident page. No outages. + +Then we looked at the instance logs. +When a node fails, GCP removes it, meaning the logs are not accessible from +the node page anymore, but can be retrieved either in the global logs. + +Looking at the logs, we discovered other services failing to resolve +hostname, like the metrics container. + +In Kubernetes, each cluster runs a `kube-dns` service/pod, which is used +by other pods to do DNS requests. +This pod was crashing. + +Looking at the node this pod was running on showed a RAM usage close to the +VM limits. +At the time, the service instances were running on `e2-small` VMs, which only +have 2GB of RAM. +In addition, we recently added more runner nodes by adding a new Windows pool. +This meant cluster size increased. This caused the cluster management services +to take more resources to run, and pushed us just above the 2GB limit. + +This causes the kube-dns service to be OOM killed, and then caused various DNS +failures in the cluster. + +### Solution + +Change the shape of the service pool to be `e2-highcpu-4`, doubling the RAM +and CPU budget. We also increased the pool size from 2 to 3. + +## LLVM dashboard graphs are empty for presubmits + +### Date: 2025-01-28 + +### Symptoms + +The LLVM dashboard was showing empty graphs for the presubmit job runtime and +queue time. Autoscaling graphs were still working. + +### Investigation + +The graphs were empty because no new metrics were received, but other GCP +metrics were still showing. +Our dashboard has multiple data source: + - the GCP cluster. + - the metrics container. + +Because we had GCP metrics, it meant the Grafana instance was working, and +the Grafana Alloy component running in the cluster was also fine. + +It was probably the metrics container. +We checked the heartbeat metric: `metrics_container_heartbeat`. +This is a simple ping recorded every minutes by the container. If this +metrics stops emitting, it means something is wrong with the job. +This metric was still being recorded. + +A recent change was made to add the windows version of the premerge check. +This caused the job name to change, and thus changed the recorded metric +names from `llvm_premerge_checks_linux_run_time` to +`llvm_premerge_checks_premerge_checks_linux_run_time`. + +### Solution + +Change the dashboards to read the new metric name instead of the previous +name, allowing new data to be shown. +SLO definitions and alerts also had to be adjusted to look at the new metrics. + +## LLVM dashboard graphs are empty for run/queue times + +### Date: 2025-01-10 + +### Symptoms + +The LLVM dashboard was showing empty graphs for the presubmit job runtime and +queue time. Autoscaling graphs were still working. + +### Investigation + +Grafana was still recording GCP metrics, but no new data coming from the +metrics container. + +A quick look at the google cloud console showed the metrics container pod was +crashing. +Looking at the logs, we saw the script failed to connect to GitHub to get +the workflow status. Reason was a bad GitHub token. + +Because we have no admin access to the GitHub admin organization, we cannot +emit LLVM owned tokens. A Googler had used its personal account to setup a +PAT token. This token expired in December, causing the metrics container +to fail since. + +### Solution + +Another Googler generated a new token, and replaced it in `Menu > Security > Secrets Manager > llvm-premerge-github-pat`. +Note: this secret is in the general secret manager, not in `Kubernetes Engine > Secrets & ConfigMaps`. + +Once the secret updated, the metrics container had to be restarted: +- `Menu > Kubernetes Engine > Workflows` +- select `metrics`. +- click `Rolling update` +- set all thresholds to `100%`, the click update. + +This will allow GCP to delete the only metrics container pod, and recreate it +using the new secret value. +Because we have a single metrics container instance running, we have to but +all thresholds to `100%`. + +In addition, we added a heartbeat metric to the container, and Grafana +alerting to make sure we detect this kind of failure early. diff --git a/premerge/monitoring.md b/premerge/monitoring.md new file mode 100644 index 000000000..7d0bdb301 --- /dev/null +++ b/premerge/monitoring.md @@ -0,0 +1,21 @@ +# Monitoring + +Presubmit monitoring is provided by Grafana. +The dashboard link is [https://llvm.grafana.net/dashboards](https://llvm.grafana.net/dashboards). + +Grafana pulls its data from 2 sources: the GCP Kubernetes cluster & GitHub. +Grafana instance access is restricted, but there is a publicly visible dashboard: +- [Public dashboard](https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd) + + +## GCP monitoring + +Cluster metrics are gathered through Grafana alloy. +This service is deployed using Helm, as described [HERE](main.tf) + +## Github monitoring + +Github CI queue and job status is fetched using a custom script which pushes +metrics to grafana. +The script itself lives in the llvm-project repository: [LINK](https://github.com/llvm/llvm-project/blob/main/.ci/metrics/metrics.py). +The deployment configuration if in the [metrics_deployment file](metrics_deployment.yaml). From e77b4b361eff5959291b6943a9dac4b20b2c0e32 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nathan=20Gau=C3=ABr?= Date: Wed, 29 Jan 2025 18:05:15 +0100 Subject: [PATCH 2/4] pr feedback --- premerge/architecture.md | 41 ++++++++++++++-------------------- premerge/cluster-management.md | 2 +- premerge/docs.md | 4 ---- premerge/issues.md | 1 - premerge/monitoring.md | 1 - 5 files changed, 18 insertions(+), 31 deletions(-) diff --git a/premerge/architecture.md b/premerge/architecture.md index 554cd7e87..a5afeccd6 100644 --- a/premerge/architecture.md +++ b/premerge/architecture.md @@ -3,10 +3,6 @@ This document describes how the GCP based presubmit infra is working, and explains common maintenance actions. ---- -NOTE: As of today, only Googlers can administrate the cluster. ---- - ## Overview Presubmit tests are using GitHub workflows. Executing GitHub workflows can be @@ -25,7 +21,7 @@ To balance cost/performance, we keep both types. LLVM has several flavor of self-hosted runners: - libcxx runners. - - MacOS runners managed by Microsoft. + - MacOS runners for HLSL managed by Microsoft. - GCP windows/linux runners managed by Google. This document only focuses on Google's GCP hosted runners. @@ -46,24 +42,24 @@ Our self hosted runners come in two flavors: ## GCP runners - Architecture overview Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). -The cluster has 3 nodes: +The cluster has 3 pools: - llvm-premerge-linux - llvm-premerge-linux-service - llvm-premerge-windows -**llvm-premerge-linux-service** is a fixed node, only used to host the +**llvm-premerge-linux-service** is a fixed pool, only used to host the services required to manage the premerge infra (controller, listeners, -monitoring). Today, this node has only one `e2-small` machine. +monitoring). Today, this pool has three `e2-highcpu-4` machine. -**llvm-premerge-linux** is a auto-scaling node with large `c2d-highcpu-56` -VMs. This node runs the Linux workflows. +**llvm-premerge-linux** is a auto-scaling pool with large `n2-standard-64` +VMs. This pool runs the Linux workflows. -**llvm-premerge-windows** is a auto-scaling node with large `c2d-highcpu-56` -VMs. Similar to the Linux node, but this time it runs Windows workflows. +**llvm-premerge-windows** is a auto-scaling pool with large `n2-standard-64` +VMs. Similar to the Linux pool, but this time it runs Windows workflows. -### Service node: llvm-premerge-linux-service +### Service pool: llvm-premerge-linux-service -This node runs all the services managing the presubmit infra. +This pool runs all the services managing the presubmit infra. - Action Runner Controller - 1 listener for the Linux runners. - 1 listener for the windows runners. @@ -73,28 +69,25 @@ The Action Runner Controller listens on the LLVM repository job queue. Individual jobs are then handled by the listeners. How a job is run: - - The controller informs GitHub the self-hosted runner is live. + - The controller informs GitHub the self-hosted runner set is live. - A PR is uploaded on GitHub - The listener finds a Linux job to run. - The listener creates a new runner pod to be scheduled by Kubernetes. - - Kubernetes adds one instance to the Linux node to schedule new pod. + - Kubernetes adds one instance to the Linux pool to schedule new pod. - The runner starts executing on the new node. - Once finished, the runner dies, meaning the pod dies. - - If the instance is not reused in the next 10 minutes, Kubernetes will scale - down the instance. + - If the instance is not reused in the next 10 minutes, the autoscaler + will turn down the instance, freeing resources. -### Worker nodes : llvm-premerge-linux, llvm-premerge-windows +### Worker pools : llvm-premerge-linux, llvm-premerge-windows -To make sure each runner pod is scheduled on the correct node (linux or -windows, avoiding the service node), we use labels & taints. +To make sure each runner pod is scheduled on the correct pool (linux or +windows, avoiding the service pool), we use labels & taints. Those taints are configured in the [ARC runner templates](linux_runners_values.yaml). The other constraints we define are the resource requirements. Without information, Kubernetes is allowed to schedule multiple pods on the instance. -This becomes very important with the container/runner tandem: - - the container HAS to run on the same instance as the runner. - - the runner itself doesn't request many resources. So if we do not enforce limits, the controller could schedule 2 runners on the same instance, forcing containers to share resources. Resource limits are defined in 2 locations: diff --git a/premerge/cluster-management.md b/premerge/cluster-management.md index 0964964fb..2213e807c 100644 --- a/premerge/cluster-management.md +++ b/premerge/cluster-management.md @@ -52,7 +52,7 @@ will see 3 node pools: - llvm-premerge-linux-service - llvm-premerge-windows -Definition for each pool is in [Architecture overview](architecture.md). +Definitions for each pool is in [Architecture overview](architecture.md). If you click on a pool, example `llvm-premerge-linux`, you will see one instance group, and maybe several nodes. diff --git a/premerge/docs.md b/premerge/docs.md index 3e2d18326..e58df7f09 100644 --- a/premerge/docs.md +++ b/premerge/docs.md @@ -3,10 +3,6 @@ This document describes how the GCP based presubmit infra is working, and explains common maintenance actions. ---- -NOTE: As of today, only Googlers can administrate the cluster. ---- - ## Overview Presubmit tests are using GitHub workflows. Executing GitHub workflows can be diff --git a/premerge/issues.md b/premerge/issues.md index 3b724d206..5c76a72c1 100644 --- a/premerge/issues.md +++ b/premerge/issues.md @@ -3,7 +3,6 @@ This document lists past issues that could be of interest if you encounter issues with the cluser/presubmit. - ## Workflows are failing: DNS resolution of github.com fails. ### Date: 2025-01-27 diff --git a/premerge/monitoring.md b/premerge/monitoring.md index 7d0bdb301..b0617faeb 100644 --- a/premerge/monitoring.md +++ b/premerge/monitoring.md @@ -7,7 +7,6 @@ Grafana pulls its data from 2 sources: the GCP Kubernetes cluster & GitHub. Grafana instance access is restricted, but there is a publicly visible dashboard: - [Public dashboard](https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd) - ## GCP monitoring Cluster metrics are gathered through Grafana alloy. From 3e2f16de5ae9c323b2738e2b203e4eeb2f0fb40f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nathan=20Gau=C3=ABr?= Date: Thu, 30 Jan 2025 14:53:13 +0100 Subject: [PATCH 3/4] removing old file --- premerge/docs.md | 80 ------------------------------------------------ 1 file changed, 80 deletions(-) delete mode 100644 premerge/docs.md diff --git a/premerge/docs.md b/premerge/docs.md deleted file mode 100644 index e58df7f09..000000000 --- a/premerge/docs.md +++ /dev/null @@ -1,80 +0,0 @@ -# LLVM Premerge infra - GCP runners - -This document describes how the GCP based presubmit infra is working, and -explains common maintenance actions. - -## Overview - -Presubmit tests are using GitHub workflows. Executing GitHub workflows can be -done in two ways: - - using GitHub provided runners. - - using self-hosted runners on GCP. - -GitHub provided runners are not very powerful, and have limitations, but they -are **FREE**. -Self hosted runners are large virtual machines, very powerful, but they are -**expensive**. - -To balance cost/performance, we keep both runners. - - simple jobs like `clang-format` shall run on GitHub runners. - - building & testing LLVM shall be done on self-hosted runners. - -The choice between self-hosted & GitHub runners is done in the workflow -definition: - -``` -jobs: - my_job_name: - # Runs on expensive GCP VMs. - runs-on: llvm-premerge-linux-runners -``` - -Our self hosted runners come in two flavors: - - linux - - windows - -## GCP runners - Architecture overview - -Our runners are hosted on a GCP Kubernetes cluster, and use the [Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller). -The cluster has 3 nodes: - - llvm-premerge-linux - - llvm-premerge-linux-service - - llvm-premerge-windows - -**llvm-premerge-linux-service** is a fixed node, only used to host the -services required to manage the premerge infra (controller, listeners, -monitoring). Today, this node has only one e2-small machine. - -**llvm-premerge-linux** is a auto-scaling node with large c2d-highcpu-56 VMs. -This node runs the linux workflows. - -**llvm-premerge-windows** is a auto-scaling node with large c2d-highcpu-56 VMs. -Similar to the linux node, but this time it runs Windows workflows. - -### Service node: llvm-premerge-linux-service - -This node runs all the services managing the presubmit infra. - - Action Runner Controller - - 1 listener for the linux runners. - - 1 listener for the windows runners. - - Grafana Alloy to gather metrics. - - -The Action Runner Controller listens on the LLVM repository job queue. -Individual jobs are then handled by the listeners. - -How a job is run: - - The controller informs GitHub the self-hosted runner is live. - - A PR is uploaded on GitHub - - The listener finds a linux job to run. - - The listener creates a new runner pod to be scheduled by Kubernetes. - - Kubernetes adds one instance to the linux node to schedule new pod. - - The runner starts executing on the new node. - - Once finished, the runner dies, meaning the pod dies. - - If the instance is not reused in the next 10 minutes, Kubernetes will scale - down the instance. - -To make sure each pod is scheduled on the correct node (linux or windows, -avoiding the service node), we use labels & tains. -Those tains are configured in the [ARC runner templates](premerge/linux_runners_values.yaml). - From 244b6f5b7de1cad5bb523cd83b606de003b3f792 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Nathan=20Gau=C3=ABr?= Date: Thu, 30 Jan 2025 15:02:08 +0100 Subject: [PATCH 4/4] pr feedback --- premerge/architecture.md | 12 ++++++------ premerge/cluster-management.md | 1 + premerge/monitoring.md | 2 +- 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/premerge/architecture.md b/premerge/architecture.md index a5afeccd6..a2ff11b5c 100644 --- a/premerge/architecture.md +++ b/premerge/architecture.md @@ -64,6 +64,7 @@ This pool runs all the services managing the presubmit infra. - 1 listener for the Linux runners. - 1 listener for the windows runners. - Grafana Alloy to gather metrics. + - metrics container. The Action Runner Controller listens on the LLVM repository job queue. Individual jobs are then handled by the listeners. @@ -82,15 +83,14 @@ How a job is run: ### Worker pools : llvm-premerge-linux, llvm-premerge-windows To make sure each runner pod is scheduled on the correct pool (linux or -windows, avoiding the service pool), we use labels & taints. -Those taints are configured in the -[ARC runner templates](linux_runners_values.yaml). +windows, avoiding the service pool), we use labels and taints. The other constraints we define are the resource requirements. Without information, Kubernetes is allowed to schedule multiple pods on the instance. So if we do not enforce limits, the controller could schedule 2 runners on the same instance, forcing containers to share resources. -Resource limits are defined in 2 locations: - - [runner configuration](linux_runners_values.yaml) - - [container template](linux_container_pod_template.yaml) + +Those bits are configures in the +[linux runner configuration](linux_runners_values.yaml) and +[windows runner configuration](windows_runners_values.yaml). diff --git a/premerge/cluster-management.md b/premerge/cluster-management.md index 2213e807c..dd157e82c 100644 --- a/premerge/cluster-management.md +++ b/premerge/cluster-management.md @@ -37,6 +37,7 @@ Currently, we have 3 clusters: - `llvm-premerge-prototype`: the cluster for those GCP hoster runners. Yes, it's called `prototype`, but that's the production cluster. +We should rename it at some point. To add a VM to the cluster, the VM has to come from a `pool`. A `pool` is a group of nodes withing a cluster that all have the same configuration. diff --git a/premerge/monitoring.md b/premerge/monitoring.md index b0617faeb..81627abfe 100644 --- a/premerge/monitoring.md +++ b/premerge/monitoring.md @@ -5,7 +5,7 @@ The dashboard link is [https://llvm.grafana.net/dashboards](https://llvm.grafana Grafana pulls its data from 2 sources: the GCP Kubernetes cluster & GitHub. Grafana instance access is restricted, but there is a publicly visible dashboard: -- [Public dashboard](https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd) +- [Public dashboard](https://llvm.grafana.net/public-dashboards/21c6e0a7cdd14651a90e118df46be4cc) ## GCP monitoring