Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre/post_bootstrap_user_data doesn't work anymore with AL2023 #3186

Open
1 task done
rgarrigue opened this issue Oct 22, 2024 · 9 comments
Open
1 task done

pre/post_bootstrap_user_data doesn't work anymore with AL2023 #3186

rgarrigue opened this issue Oct 22, 2024 · 9 comments
Labels

Comments

@rgarrigue
Copy link

rgarrigue commented Oct 22, 2024

Description

I switched my EKSes managed node group to AMI_TYPE AL2023_x86_64_STANDARD (from AL2_x86_64 previously). Then my user_data stopped working, I can see this Unhandled unknown content-type in journalctl -u cloud-init.service

Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: | Route | Destination | Gateway | Interface | Flags |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: |   0   |  fe80::/64  |    ::   |  enp39s0  |   U   |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: |   2   |    local    |    ::   |  enp39s0  |   U   |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: |   3   |  multicast  |    ::   |  enp39s0  |   U   |
Oct 21 11:29:03 localhost cloud-init[2783]: ci-info: +-------+-------------+---------+-----------+-------+
Oct 21 11:29:03 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: 2024-10-21 11:29:03,539 - __init__.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Generating public/private ed25519 key pair.
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Your identification has been saved in /etc/ssh/ssh_host_ed25519_key
Oct 21 11:29:04 ip-10-20-10-69.eu-north-1.compute.internal cloud-init[2783]: Your public key has been saved in /etc/ssh/ssh_host_ed25519_key.pub

And comparing with AL2 worker nodes, the part-001 & co script files are absent, aka the scripts/ folder is empty

/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts/part-001
/var/lib/cloud/instances/i-0faebac7b8b11778c/scripts/part-002
  • ✋ I have searched the open/closed issues and my issue is not listed.

Versions

  • Module version [Required]: 20.24.2

  • Terraform version: ```Terraform v1.6.6
    on linux_amd64

  • provider registry.terraform.io/hashicorp/aws v5.72.1
  • provider registry.terraform.io/hashicorp/cloudinit v2.3.5
  • provider registry.terraform.io/hashicorp/kubernetes v2.21.1
  • provider registry.terraform.io/hashicorp/null v3.2.3
  • provider registry.terraform.io/hashicorp/time v0.12.1
  • provider registry.terraform.io/hashicorp/tls v4.0.6```
  • Provider version(s): Execute: terraform providers -version : same output as above (issue template to be updated ?)

Reproduction Code [Required]

Steps to reproduce the behavior:

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "20.26.0"

  cluster_name = "test"
  cluster_version = "1.31"

  # Network
  vpc_id     = "vpc-0052643b5ded2cce4"
  subnet_ids = ["subnet-0304ee0b265a7d4a3","subnet-0ee42ef7b5d2d5a71"]

  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true

  # Addons
  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent    = true
      before_compute = true
    }
  }

  eks_managed_node_group_defaults = {
    ami_type       = "AL2023_x86_64_STANDARD"
    instance_types = ["c5.large"]
    launch_template_name = "test"

    attach_cluster_primary_security_group = true

    iam_role_additional_policies = {
      "ssm" : "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    }

    post_bootstrap_user_data = <<-EOT
      echo
      echo "Add ops' shared public key to '$(whoami)' user SSH's authorized_keys"
      echo
      groupadd ops
      useradd -s /bin/bash -g ops ops
      mkdir -p /home/ops/.ssh
      chmod 0700 /home/ops/.ssh
      echo "ssh-ed25519 AAAAC3Nz______________dQpkJ5 ops shared key" | tee /home/ops/.ssh/authorized_keys
      chmod 0444 /home/ops/.ssh/authorized_keys
      chown -R ops: /home/ops
      echo "ops ALL=(ALL) NOPASSWD: ALL" | tee /etc/sudoers.d/ops
      chmod 0400 /etc/sudoers.d/ops
    EOT
  }

  eks_managed_node_groups = {
    default = {
      name         = "test"
      min_size     = 1
      max_size     = 1
      desired_size = 1
      subnet_ids   = ["subnet-0304ee0b265a7d4a3","subnet-0ee42ef7b5d2d5a71"]

      block_device_mappings = {
        xvda = {
          device_name = "/dev/xvda"
          ebs = {
            volume_size           = 100
            volume_type           = "gp3"
            iops                  = 200
            delete_on_termination = true
          }
        }
      }
    }
  }
}

No workspace
Local cache cleared
List steps : replace AMI_TYPE value by AL2023_x86_64_STANDARD

Expected behavior

My user data to be executed, hence the ops user created, so with this ~/.ssh/config

host i-* mi-*
  ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters 'portNumber=%p'"
  StrictHostKeyChecking no
  User ops
  IdentityFile ops

I can

❯ ssh i-08042aefcc8bb7624
Updates Information Summary: available
    1 Security notice(s)
        1 Medium Security notice(s)

   ,     #_
   ~\_  ####_        Amazon Linux 2023
  ~~  \_#####\
  ~~     \###|
  ~~       \#/ ___   https://aws.amazon.com/linux/amazon-linux-2023
   ~~       V~' '->
    ~~~         /
      ~~._.   _/
         _/ _/
       _/m/'
Last login: Tue Oct 22 07:42:33 2024 from 127.0.0.1

Actual behavior

❯ ssh i-0485fe90afd97a39e
Warning: Permanently added 'i-0485fe90afd97a39e' (ED25519) to the list of known hosts.
Received disconnect from UNKNOWN port 65535:2: Too many authentication failures
Disconnected from UNKNOWN port 65535

I have to open the AWS console, go to EC2 instance, connect via SSM, sudo, execute my user data, and only then I can SSH in as intended behavior.

Edit

Fixed TF snippet, tried with module latest 20.26.0, not better

@Indigenuity
Copy link

I can confirm this with module version 2.26, but also just from diving into the module code. The userdata for AL2023 completely ignores any values in the pre_bootstrap_user_data and post_bootstrap_user_data variables. I can see that the template file makes no reference to either variable.

Instead, completely new variables with new expected syntax were introduced: cloudinit_pre_nodeadm and cloudinit_post_nodeadm. I don't see these vars or the new behavior documented anywhere.

Is the intent to stop supporting the userdata vars in this module? Or was it an oversight to leave out those variables from the AL2023 template file?

@bryantbiggs
Copy link
Member

Al2023 uses a different form of user data than AL2 -

module "eks_mng_al2023_no_op" {
source = "../../modules/_user_data"
ami_type = "AL2023_x86_64_STANDARD"
# Hard requirement
cluster_service_cidr = local.cluster_service_cidr
}
module "eks_mng_al2023_additional" {
source = "../../modules/_user_data"
ami_type = "AL2023_x86_64_STANDARD"
# Hard requirement
cluster_service_cidr = local.cluster_service_cidr
cloudinit_pre_nodeadm = [{
content = <<-EOT
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
config:
shutdownGracePeriod: 30s
featureGates:
DisableKubeletCloudCredentialProviders: true
EOT
content_type = "application/node.eks.aws"
}]
}
module "eks_mng_al2023_custom_ami" {
source = "../../modules/_user_data"
ami_type = "AL2023_x86_64_STANDARD"
cluster_name = local.name
cluster_endpoint = local.cluster_endpoint
cluster_auth_base64 = local.cluster_auth_base64
cluster_service_cidr = local.cluster_service_cidr
enable_bootstrap_user_data = true
cloudinit_pre_nodeadm = [{
content = <<-EOT
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
config:
shutdownGracePeriod: 30s
featureGates:
DisableKubeletCloudCredentialProviders: true
EOT
content_type = "application/node.eks.aws"
}]
cloudinit_post_nodeadm = [{
content = <<-EOT
echo "All done"
EOT
content_type = "text/x-shellscript; charset=\"us-ascii\""
}]
}
module "eks_mng_al2023_custom_template" {
source = "../../modules/_user_data"
ami_type = "AL2023_x86_64_STANDARD"
cluster_name = local.name
cluster_endpoint = local.cluster_endpoint
cluster_auth_base64 = local.cluster_auth_base64
cluster_service_cidr = local.cluster_service_cidr
enable_bootstrap_user_data = true
user_data_template_path = "${path.module}/templates/al2023_custom.tpl"
cloudinit_pre_nodeadm = [{
content = <<-EOT
---
apiVersion: node.eks.aws/v1alpha1
kind: NodeConfig
spec:
kubelet:
config:
shutdownGracePeriod: 30s
featureGates:
DisableKubeletCloudCredentialProviders: true
EOT
content_type = "application/node.eks.aws"
}]
cloudinit_post_nodeadm = [{
content = <<-EOT
echo "All done"
EOT
content_type = "text/x-shellscript; charset=\"us-ascii\""
}]
}

@Indigenuity
Copy link

@bryantbiggs Yes, and Windows also has a different form of user data than AL2, but they use the same module variables to build the templates. Are the concepts all that different between AL2 and AL2023? AL2023 seems to work the same way that AL2 works when specifying an AMI in the launch template. The only difference is an additional section for a NodeConfig in its multipart MIME.

I think this is just a matter of broken docs and expectations, not broken code. The logic for shimming a userdata script into a multipart MIME was already in this module, and it used the same userdata variables employed in other scenarios. So despite the fact that the new variables work well and allow flexibility in building a custom multipart MIME message, it is a bit unexpected to have new variables, especially given that the userdata readme still suggests using the older ones.

I'm happy to make some readme update suggestions, though I'm not sure I quite understand the conditionals in the userdata module, and I've probably misunderstood something in the new AL2023 format anyway. If I've just misunderstood, then sorry. In any case, thanks for the time spent on this.

@rgarrigue
Copy link
Author

An updated README would suit me fine, my current problem is I don't know how to get started

@BeckYeh
Copy link

BeckYeh commented Nov 26, 2024

I use settings blow:

cloudinit_pre_nodeadm = [
        {
          content_type = "text/x-shellscript; charset=\"us-ascii\""
          content      = <<-EOT
            #!/usr/bin/env bash
            dnf install jq -y
            TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
            INSTANCE_ID=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-id)
            HOSTNAME=`hostname -s`
            REGION=$(curl -s -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)
            aws ec2 create-tags --resources $INSTANCE_ID --tags Key=Name,Value="k8s-worker-node-$HOSTNAME" --region $REGION
          EOT
        }
]

al2023 shell script need to define #!
because it will use python to run the script, and the information will be record in cloud-init.log.

@rgarrigue
Copy link
Author

Doesn't work for me, still can't find my script (grep'ing in the /var/lib don't yield anything)

...
  eks_managed_node_group_defaults = {
    ami_type       = "AL2023_x86_64_STANDARD"
    instance_types = var.default_workers_instance_types
    # disk_size # this is ignored since using a custom LT
    launch_template_name = module.context.prefix

    iam_role_additional_policies = {
      "ssm" : "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore",
    }

    cloudinit_post_nodeadm = [{
      content_type = "text/x-shellscript; charset=\"us-ascii\""
      content      = <<-EOT
        #!/usr/bin/env bash
        echo
        echo "Add ops' shared public key to '$(whoami)' user SSH's authorized_keys"
        echo
        ...
      EOT
    }]
  }
...

Still the error

2024-12-06 08:27:49,438 - init.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'

Wondering if the above is unrelated, and I'm simply not setting the cloudinit_post_nodeadm in the proper place ? As illustrated in the screenshot, I'm having it in module "eks".eks_managed_node_group_defaults.cloudinit_post_nodeadm. There's no other cloudinit_post_nodeadm in my code, hence not overriding it somewhere else.

image

@BeckYeh
Copy link

BeckYeh commented Dec 6, 2024

please check lanch template's user data first and check does script had been add in the template. maybe you need to setup use_custom_launch_template = true.

@bryantbiggs
Copy link
Member

2024-12-06 08:27:49,438 - init.py[WARNING]: Unhandled unknown content-type (application/node.eks.aws) userdata: 'b'---'...'

This is benign and not a module issue awslabs/amazon-eks-ami#1963 (comment)

Also, you can't execute anything after the node bootstrap process when using EKS managed nodegroups - this is just the way EKS managed node groups work. When you supply user data to EKS MNG, it gets prepended to the user data that EKS MNG supplies - that means you can only run scripts *BEFORE the node bootstrap. We model that here by using *pre* as in pre_bootstrap_user_data and cloudinit_pre_nodeadm

See the docs here https://docs.aws.amazon.com/eks/latest/userguide/launch-templates.html#launch-template-user-data

tl;dr - you need to first understand how EKS works and what changes are happening between each Kubernetes version supported in EKS. Once you understand that, you'll better understand how that is reflected here.

@rgarrigue
Copy link
Author

rgarrigue commented Dec 17, 2024

tl;dr - you need to first understand how EKS works and what changes are happening between each Kubernetes version supported in EKS. Once you understand that, you'll better understand how that is reflected here.

I get your point. I wish I could apply your advice. But reality is, I'm just driving my EKS car, I'm not an EKS car mechanic, if you get the analogy. Best I can do is swapping summer/winter tires. If someday some almighty make the days 40 hours long instead of 24, then maybe I'll get there. Until then, thanks a lot to the community & you guys for answering my question 🙏🏿

Now, unless you want to keep the issue open to remind to add doc or whatever, we can close it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants