Skip to content

Commit

Permalink
feat!: Release Kubernetes Autoscaling and EKS 1.29 Upgrade (#301)
Browse files Browse the repository at this point in the history
* feat!: Enable autoscaling and install required ancillary tooling to support it (#257)

* feat!: add autoscaling to terraform modules and change variable lookup order for instance sizing

* feat!: add cluster autoscaler to cluster

BREAKING CHANGE: A number of variable defaults are removed and variables renamed for node counts.

* Init EKS 1.29 upgrade (#296)

* fix examples

* fix: update output var in examples

* fix: use the 17.x version of the variables

* fix autoscaling

* fix

* fix

* fix circular dep

* fix cluster_name

* remove dependency

* revert revert

* test

* fix

* trying to correct the output

* testing cluster_name

* fix: override the name_prefix for node_groups so they still have the az letter

---------

Co-authored-by: Daniel Panzella <[email protected]>
Co-authored-by: Daniel Panzella <[email protected]>
  • Loading branch information
3 people authored Oct 14, 2024
1 parent 47b06e1 commit e9a5f02
Show file tree
Hide file tree
Showing 33 changed files with 406 additions and 459 deletions.
70 changes: 65 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,22 @@ module "wandb" {

- Run `terraform init` and `terraform apply`

## Cluster Sizing

By default, the type of kubernetes instances, number of instances, redis cluster size, and database instance sizes are
standardized via configurations in [./deployment-size.tf](deployment-size.tf), and is configured via the `size` input
variable.

Available sizes are, `small`, `medium`, `large`, `xlarge`, and `xxlarge`. Default is `small`.

All the values set via `deployment-size.tf` can be overridden by setting the appropriate input variables.

- `kubernetes_instance_types` - The instance type for the EKS nodes
- `kubernetes_min_nodes_per_az` - The minimum number of nodes in each AZ for the EKS cluster
- `kubernetes_max_nodes_per_az` - The maximum number of nodes in each AZ for the EKS cluster
- `elasticache_node_type` - The instance type for the redis cluster
- `database_instance_class` - The instance type for the database

## Examples

We have included documentation and reference examples for additional common
Expand Down Expand Up @@ -124,7 +140,7 @@ Upgrades must be executed in step-wise fashion from one version to the next. You

| Name | Version |
|------|---------|
| <a name="provider_aws"></a> [aws](#provider\_aws) | ~> 4.0 |
| <a name="provider_aws"></a> [aws](#provider\_aws) | 4.67.0 |

## Modules

Expand Down Expand Up @@ -164,12 +180,14 @@ Upgrades must be executed in step-wise fashion from one version to the next. You
| <a name="input_bucket_kms_key_arn"></a> [bucket\_kms\_key\_arn](#input\_bucket\_kms\_key\_arn) | n/a | `string` | `""` | no |
| <a name="input_bucket_name"></a> [bucket\_name](#input\_bucket\_name) | n/a | `string` | `""` | no |
| <a name="input_bucket_path"></a> [bucket\_path](#input\_bucket\_path) | path of where to store data for the instance-level bucket | `string` | `""` | no |
| <a name="input_clickhouse_endpoint_service_id"></a> [clickhouse\_endpoint\_service\_id](#input\_clickhouse\_endpoint\_service\_id) | The service ID of the VPC endpoint service for Clickhouse | `string` | `""` | no |
| <a name="input_controller_image_tag"></a> [controller\_image\_tag](#input\_controller\_image\_tag) | Tag of the controller image to deploy | `string` | `"1.14.0"` | no |
| <a name="input_create_bucket"></a> [create\_bucket](#input\_create\_bucket) | ######################################### External Bucket # ######################################### Most users will not need these settings. They are ment for users who want a bucket and sqs that are in a different account. | `bool` | `true` | no |
| <a name="input_create_elasticache"></a> [create\_elasticache](#input\_create\_elasticache) | Boolean indicating whether to provision an elasticache instance (true) or not (false). | `bool` | `true` | no |
| <a name="input_create_vpc"></a> [create\_vpc](#input\_create\_vpc) | Boolean indicating whether to deploy a VPC (true) or not (false). | `bool` | `true` | no |
| <a name="input_custom_domain_filter"></a> [custom\_domain\_filter](#input\_custom\_domain\_filter) | A custom domain filter to be used by external-dns instead of the default FQDN. If not set, the local FQDN is used. | `string` | `null` | no |
| <a name="input_database_binlog_format"></a> [database\_binlog\_format](#input\_database\_binlog\_format) | Specifies the binlog\_format value to set for the database | `string` | `"ROW"` | no |
| <a name="input_database_engine_version"></a> [database\_engine\_version](#input\_database\_engine\_version) | Version for MySQL Auora | `string` | `"8.0.mysql_aurora.3.05.2"` | no |
| <a name="input_database_engine_version"></a> [database\_engine\_version](#input\_database\_engine\_version) | Version for MySQL Aurora | `string` | `"8.0.mysql_aurora.3.07.1"` | no |
| <a name="input_database_innodb_lru_scan_depth"></a> [database\_innodb\_lru\_scan\_depth](#input\_database\_innodb\_lru\_scan\_depth) | Specifies the innodb\_lru\_scan\_depth value to set for the database | `number` | `128` | no |
| <a name="input_database_instance_class"></a> [database\_instance\_class](#input\_database\_instance\_class) | Instance type to use by database master instance. | `string` | `"db.r5.large"` | no |
| <a name="input_database_kms_key_arn"></a> [database\_kms\_key\_arn](#input\_database\_kms\_key\_arn) | n/a | `string` | `""` | no |
Expand All @@ -183,14 +201,16 @@ Upgrades must be executed in step-wise fashion from one version to the next. You
| <a name="input_eks_cluster_version"></a> [eks\_cluster\_version](#input\_eks\_cluster\_version) | EKS cluster kubernetes version | `string` | n/a | yes |
| <a name="input_eks_policy_arns"></a> [eks\_policy\_arns](#input\_eks\_policy\_arns) | Additional IAM policy to apply to the EKS cluster | `list(string)` | `[]` | no |
| <a name="input_elasticache_node_type"></a> [elasticache\_node\_type](#input\_elasticache\_node\_type) | The type of the redis cache node to deploy | `string` | `"cache.t2.medium"` | no |
| <a name="input_enable_dummy_dns"></a> [enable\_dummy\_dns](#input\_enable\_dummy\_dns) | Boolean indicating whether or not to enable dummy DNS for the old alb | `bool` | `false` | no |
| <a name="input_enable_operator_alb"></a> [enable\_operator\_alb](#input\_enable\_operator\_alb) | Boolean indicating whether to use operatore ALB (true) or not (false). | `bool` | `false` | no |
| <a name="input_enable_clickhouse"></a> [enable\_clickhouse](#input\_enable\_clickhouse) | Provision clickhouse resources | `bool` | `false` | no |
| <a name="input_enable_yace"></a> [enable\_yace](#input\_enable\_yace) | deploy yet another cloudwatch exporter to fetch aws resources metrics | `bool` | `true` | no |
| <a name="input_external_dns"></a> [external\_dns](#input\_external\_dns) | Using external DNS. A `subdomain` must also be specified if this value is true. | `bool` | `false` | no |
| <a name="input_extra_fqdn"></a> [extra\_fqdn](#input\_extra\_fqdn) | Additional fqdn's must be in the same hosted zone as `domain_name`. | `list(string)` | `[]` | no |
| <a name="input_kms_clickhouse_key_alias"></a> [kms\_clickhouse\_key\_alias](#input\_kms\_clickhouse\_key\_alias) | KMS key alias for AWS KMS Customer managed key used by Clickhouse CMEK. | `string` | `null` | no |
| <a name="input_kms_clickhouse_key_policy"></a> [kms\_clickhouse\_key\_policy](#input\_kms\_clickhouse\_key\_policy) | The policy that will define the permissions for the clickhouse kms key. | `string` | `""` | no |
| <a name="input_kms_key_alias"></a> [kms\_key\_alias](#input\_kms\_key\_alias) | KMS key alias for AWS KMS Customer managed key. | `string` | `null` | no |
| <a name="input_kms_key_deletion_window"></a> [kms\_key\_deletion\_window](#input\_kms\_key\_deletion\_window) | Duration in days to destroy the key after it is deleted. Must be between 7 and 30 days. | `number` | `7` | no |
| <a name="input_kms_key_policy"></a> [kms\_key\_policy](#input\_kms\_key\_policy) | The policy that will define the permissions for the kms key. | `string` | `""` | no |
| <a name="input_kms_key_policy_administrator_arn"></a> [kms\_key\_policy\_administrator\_arn](#input\_kms\_key\_policy\_administrator\_arn) | The principal that will be allowed to manage the kms key. | `string` | `""` | no |
| <a name="input_kubernetes_alb_internet_facing"></a> [kubernetes\_alb\_internet\_facing](#input\_kubernetes\_alb\_internet\_facing) | Indicates whether or not the ALB controlled by the Amazon ALB ingress controller is internet-facing or internal. | `bool` | `true` | no |
| <a name="input_kubernetes_alb_subnets"></a> [kubernetes\_alb\_subnets](#input\_kubernetes\_alb\_subnets) | List of subnet ID's the ALB will use for ingress traffic. | `list(string)` | `[]` | no |
| <a name="input_kubernetes_instance_types"></a> [kubernetes\_instance\_types](#input\_kubernetes\_instance\_types) | EC2 Instance type for primary node group. | `list(string)` | <pre>[<br> "m5.large"<br>]</pre> | no |
Expand All @@ -212,6 +232,7 @@ Upgrades must be executed in step-wise fashion from one version to the next. You
| <a name="input_network_private_subnets"></a> [network\_private\_subnets](#input\_network\_private\_subnets) | A list of the identities of the private subnetworks in which resources will be deployed. | `list(string)` | `[]` | no |
| <a name="input_network_public_subnet_cidrs"></a> [network\_public\_subnet\_cidrs](#input\_network\_public\_subnet\_cidrs) | List of private subnet CIDR ranges to create in VPC. | `list(string)` | <pre>[<br> "10.10.0.0/24",<br> "10.10.1.0/24"<br>]</pre> | no |
| <a name="input_network_public_subnets"></a> [network\_public\_subnets](#input\_network\_public\_subnets) | A list of the identities of the public subnetworks in which resources will be deployed. | `list(string)` | `[]` | no |
| <a name="input_operator_chart_version"></a> [operator\_chart\_version](#input\_operator\_chart\_version) | Version of the operator chart to deploy | `string` | `"1.3.4"` | no |
| <a name="input_other_wandb_env"></a> [other\_wandb\_env](#input\_other\_wandb\_env) | Extra environment variables for W&B | `map(any)` | `{}` | no |
| <a name="input_parquet_wandb_env"></a> [parquet\_wandb\_env](#input\_parquet\_wandb\_env) | Extra environment variables for W&B | `map(string)` | `{}` | no |
| <a name="input_private_link_allowed_account_ids"></a> [private\_link\_allowed\_account\_ids](#input\_private\_link\_allowed\_account\_ids) | List of AWS account IDs allowed to access the VPC Endpoint Service | `list(string)` | `[]` | no |
Expand Down Expand Up @@ -246,7 +267,7 @@ Upgrades must be executed in step-wise fashion from one version to the next. You
| <a name="output_eks_node_count"></a> [eks\_node\_count](#output\_eks\_node\_count) | n/a |
| <a name="output_eks_node_instance_type"></a> [eks\_node\_instance\_type](#output\_eks\_node\_instance\_type) | n/a |
| <a name="output_elasticache_connection_string"></a> [elasticache\_connection\_string](#output\_elasticache\_connection\_string) | n/a |
| <a name="output_internal_app_port"></a> [internal\_app\_port](#output\_internal\_app\_port) | n/a |
| <a name="output_kms_clickhouse_key_arn"></a> [kms\_clickhouse\_key\_arn](#output\_kms\_clickhouse\_key\_arn) | The Amazon Resource Name of the KMS key used to encrypt Weave data at rest in Clickhouse. |
| <a name="output_kms_key_arn"></a> [kms\_key\_arn](#output\_kms\_key\_arn) | The Amazon Resource Name of the KMS key used to encrypt data at rest. |
| <a name="output_network_id"></a> [network\_id](#output\_network\_id) | The identity of the VPC in which resources are deployed. |
| <a name="output_network_private_subnets"></a> [network\_private\_subnets](#output\_network\_private\_subnets) | The identities of the private subnetworks deployed within the VPC. |
Expand All @@ -263,6 +284,45 @@ Upgrades must be executed in step-wise fashion from one version to the next. You

See our upgrade guide [here](./docs/operator-migration/readme.md)

### Upgrading from 4.x -> 5.x

5.0.0 introduced autoscaling to the EKS cluster and made the `size` variable the preferred way to set the cluster size.
Previously, unless the `size` variable was set explicitly, there were default values for the following variables:
- `kubernetes_instance_types`
- `kubernetes_node_count`
- `elasticache_node_type`
- `database_instance_class`

The `size` variable is now defaulted to `small`, and the following values to can be used to partially override the values
set by the `size` variable:
- `kubernetes_instance_types`
- `kubernetes_min_nodes_per_az`
- `kubernetes_max_nodes_per_az`
- `elasticache_node_type`
- `database_instance_class`

For more information on the available sizes, see the [Cluster Sizing](#cluster-sizing) section.

If having the cluster scale nodes in and out is not desired, the `kubernetes_min_nodes_per_az` and
`kubernetes_max_nodes_per_az` can be set to the same value to prevent the cluster from scaling.

This upgrade is also intended to be used when upgrading eks to 1.29.

We have upgraded the following dependencies and Kubernetes addons:

- MySQL Aurora (8.0.mysql_aurora.3.07.1)
- redis (7.1)
- external-dns helm chart (v1.15.0)
- aws-efs-csi-driver (v2.0.7-eksbuild.1)
- aws-ebs-csi-driver (v1.35.0-eksbuild.1)
- coredns (v1.11.3-eksbuild.1)
- kube-proxy (v1.29.7-eksbuild.9)
- vpc-cni (v1.18.3-eksbuild.3)

> :warning: Please remove the `enable_dummy_dns` and `enable_operator_alb` variables
> as they are no longer valid flags. They were provided to support older versions of
> the module that relied on an alb not created by the ingress controller.
### Upgrading from 3.x -> 4.x

- If egress access for retrieving the wandb/controller image is not available, Terraform apply may experience failures.
Expand Down
45 changes: 25 additions & 20 deletions deployment-size.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,39 @@
locals {
deployment_size = {
small = {
db = "db.r6g.large",
node_count = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
db = "db.r6g.large",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
},
medium = {
db = "db.r6g.xlarge",
node_count = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
db = "db.r6g.xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
},
large = {
db = "db.r6g.2xlarge",
node_count = 2,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
db = "db.r6g.2xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
},
xlarge = {
db = "db.r6g.4xlarge",
node_count = 3,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
db = "db.r6g.4xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
},
xxlarge = {
db = "db.r6g.8xlarge",
node_count = 3,
node_instance = "r6i.4xlarge"
cache = "cache.m6g.2xlarge"
db = "db.r6g.8xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 3,
node_instance = "r6i.4xlarge"
cache = "cache.m6g.2xlarge"
}
}
}
19 changes: 8 additions & 11 deletions examples/byo-vpc-eks-sql-redis/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -118,20 +118,14 @@ locals {
module "app_lb" {
source = "../../modules/app_lb"

namespace = var.namespace
load_balancing_scheme = var.public_access ? "PUBLIC" : "PRIVATE"
acm_certificate_arn = local.acm_certificate_arn
zone_id = var.zone_id

fqdn = local.full_fqdn
extra_fqdn = local.extra_fqdn
namespace = var.namespace
allowed_inbound_cidr = var.allowed_inbound_cidr
allowed_inbound_ipv6_cidr = var.allowed_inbound_ipv6_cidr
target_port = local.internal_app_port

network_id = local.network_id
network_private_subnets = local.network_private_subnets
network_public_subnets = local.network_public_subnets
private_endpoint_cidr = var.allowed_private_endpoint_cidr
enable_private_only_traffic = var.enable_private_only_traffic

network_id = local.network_id
}

module "private_link" {
Expand All @@ -145,6 +139,9 @@ module "private_link" {
alb_name = local.lb_name_truncated
vpc_id = local.network_id

enable_private_only_traffic = var.enable_private_only_traffic
nlb_security_group = module.app_lb.nlb_security_group

depends_on = [
module.wandb
]
Expand Down
13 changes: 13 additions & 0 deletions examples/byo-vpc-eks-sql-redis/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,19 @@ variable "network_private_subnets" {
type = list(string)
}

variable "allowed_private_endpoint_cidr" {
description = "Private CIDRs allowed to access wandb-server."
nullable = false
type = list(string)
default = []
}

variable "enable_private_only_traffic" {
description = "Enable private only traffic from customer private network"
type = bool
default = false
}

variable "network_public_subnets" {
description = "A list of the identities of the public subnetworks in which resources will be deployed."
type = list(string)
Expand Down
4 changes: 2 additions & 2 deletions examples/byo-vpc-eks/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -75,11 +75,11 @@ module "wandb_infra" {
}

data "aws_eks_cluster" "app_cluster" {
name = module.wandb_infra.cluster_id
name = module.wandb_infra.cluster_name
}

data "aws_eks_cluster_auth" "app_cluster" {
name = module.wandb_infra.cluster_id
name = module.wandb_infra.cluster_name
}

provider "kubernetes" {
Expand Down
2 changes: 0 additions & 2 deletions examples/byo-vpc-eks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -70,14 +70,12 @@ variable "bucket_kms_key_arn" {
default = ""
}


variable "allowed_inbound_cidr" {
default = ["0.0.0.0/0"]
nullable = false
type = list(string)
}


variable "allowed_inbound_ipv6_cidr" {
default = ["::/0"]
nullable = false
Expand Down
34 changes: 15 additions & 19 deletions examples/byo-vpc-sql/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ data "aws_sqs_queue" "file_storage" {
}

data "aws_eks_cluster" "app_cluster" {
name = module.app_eks.cluster_id
name = module.app_eks.cluster_name
}

data "aws_eks_cluster_auth" "app_cluster" {
name = module.app_eks.cluster_id
name = module.app_eks.cluster_name
}

provider "kubernetes" {
Expand Down Expand Up @@ -161,13 +161,12 @@ module "app_eks" {
namespace = var.namespace
kms_key_arn = local.kms_key_arn

instance_types = try([local.deployment_size[var.size].node_instance], var.kubernetes_instance_types)
desired_capacity = try(local.deployment_size[var.size].node_count, var.kubernetes_node_count)
map_accounts = var.kubernetes_map_accounts
map_roles = var.kubernetes_map_roles
map_users = var.kubernetes_map_users
instance_types = try([local.deployment_size[var.size].node_instance], var.kubernetes_instance_types)
map_accounts = var.kubernetes_map_accounts
map_roles = var.kubernetes_map_roles
map_users = var.kubernetes_map_users

bucket_kms_key_arn = local.use_external_bucket ? var.bucket_kms_key_arn : local.kms_key_arn
bucket_kms_key_arns = local.use_external_bucket ? var.bucket_kms_key_arn : local.kms_key_arn
bucket_arn = data.aws_s3_bucket.file_storage.arn
bucket_sqs_queue_arn = local.use_internal_queue ? null : data.aws_sqs_queue.file_storage.0.arn

Expand Down Expand Up @@ -202,20 +201,14 @@ locals {
module "app_lb" {
source = "../../modules/app_lb"

namespace = var.namespace
load_balancing_scheme = var.public_access ? "PUBLIC" : "PRIVATE"
acm_certificate_arn = local.acm_certificate_arn
zone_id = var.zone_id

fqdn = local.full_fqdn
extra_fqdn = local.extra_fqdn
namespace = var.namespace
allowed_inbound_cidr = var.allowed_inbound_cidr
allowed_inbound_ipv6_cidr = var.allowed_inbound_ipv6_cidr
target_port = local.internal_app_port

network_id = local.network_id
network_private_subnets = local.network_private_subnets
network_public_subnets = local.network_public_subnets
private_endpoint_cidr = var.allowed_private_endpoint_cidr
enable_private_only_traffic = var.enable_private_only_traffic

network_id = local.network_id
}

module "private_link" {
Expand All @@ -229,6 +222,9 @@ module "private_link" {
alb_name = local.lb_name_truncated
vpc_id = local.network_id

enable_private_only_traffic = var.enable_private_only_traffic
nlb_security_group = module.app_lb.nlb_security_group

depends_on = [
module.wandb
]
Expand Down
Loading

0 comments on commit e9a5f02

Please sign in to comment.