AWS: Karpenter and SSH for Kubernetes WorkerNodes

By | 06/23/2024
 

We have an AWS EKS cluster with WorkerNodes/EC2 created with Karpenter.

The process of creating the infrastructure, cluster, and launching Karpenter is described in previous posts:

What this system really lacks, is access to servers via SSH, without which you feel like… Well, like DevOps, not Infrastructure Engineer. In short, SSH access is sometimes necessary, but – surprise – Karpenter does not allow you to add a key to the WorkerNodes it manages out of the box.

Although, what’s the problem to add in the EC2NodeClass a way to pass an SSH key, as it is done in the Terraform’s resource "aws_instance" with the parameter key_name?

But it’s okay. If it’s not there, it’s not there. Maybe they will add it later.

Instead, Karpenter’s documentation Can I add SSH keys to a NodePool? suggests using either AWS Systems Manager Session Manager or AWS EC2 Instance Connect, or “the old school way” – add the public part of the key via AWS EC2 User Data, and connect via a bastion host or a VPN.

So what we’re going to do today is:

  • try all three solutions one by one
  • first we’ll do each one by hand, then we’ll see how to add it to our automation with Terraform
  • and then we’ll decide which option will be the easiest

Option 1: AWS Systems Manager Session Manager and SSH on EC2

AWS Systems Manager Session Manager is used to manage EC2 instances. In general, it can do quite a lot, for example, keep track of patches and updates for packages that are installed on instances.

Currently, we are only interested in it as a system that will allow us to have an SSH to a Kubernetes WorkerNode.

It requires an SSM agent, which is installed by default on all instances with Amazon Linux AMI.

Find nodes created by Karpenter (we have a dedicated label for them):

$ kubectl get node -l created-by=karpenter
NAME                          STATUS   ROLES    AGE     VERSION
ip-10-0-34-239.ec2.internal   Ready    <none>   21h     v1.28.8-eks-ae9a62a
ip-10-0-35-100.ec2.internal   Ready    <none>   9m28s   v1.28.8-eks-ae9a62a
ip-10-0-39-0.ec2.internal     Ready    <none>   78m     v1.28.8-eks-ae9a62a
...

Find an Instance ID:

$ kubectl get node ip-10-0-34-239.ec2.internal -o json | jq -r ".spec.providerID" | cut -d \/ -f5 
i-011b1c0b5857b0d92

AWS CLI: TargetNotConnected when calling the StartSession operation

Try to connect and get the error “TargetNotConnected“:

$ aws --profile work ssm start-session --target i-011b1c0b5857b0d92

An error occurred (TargetNotConnected) when calling the StartSession operation: i-011b1c0b5857b0d92 is not connected.

Or through the AWS Console:

But here, too, we have the connection error – “SSM Agent is not online“:

The error occurs because:

  • either the IAM role that is connected to the instance does not have SSM permissions
  • or EC2 is running on a private network and the agent cannot connect to an external endpoint

SessionManager and IAM Policy

Let’s check. Find the IAM Role attached to this EC2:

And the policies connected to it – there is nothing about SSM:

Edit the Role by hand for now, then we will do it in Terraform code:

Attach the AmazonSSMManagedInstanceCore:

And in a minute or two, try again:

SessionManager and VPC Endpoint

Another possible reason for the problems connecting the SSM agent to AWS is that the instance does not have access to SSM endpoints:

  • ssm.region.amazonaws.com
  • ssmmessages.region.amazonaws.com
  • ec2messages.region.amazonaws.com

If the subnet is private, and has limits on external connections, then you may need to create a VPC Endpoint for SSM.

See SSM Agent is not online and Troubleshooting Session Manager.

AWS CLI: SessionManagerPlugin is not found

However, after the IAM fix, when connecting from a workstation using the AWS CLI, we can get the “SessionManagerPlugin is not found” error:

$ aws --profile work ssm start-session --target i-011b1c0b5857b0d92

SessionManagerPlugin is not found. Please refer to SessionManager Documentation here: http://docs.aws.amazon.com/console/systems-manager/session-manager-plugin-not-found

Install it locally – see the documentation Installing the Session Manager Plugin for the AWS CLI.

For Arch Linux there is a aws-session-manager-plugin package in AUR:

$ yay -S aws-session-manager-plugin

And now we can connect:

$ aws --profile work ssm start-session --target i-011b1c0b5857b0d92

Starting session with SessionId: arseny-33ahofrlx7bwlecul2mkvq46gy
sh-4.2$ 

All that’s left is to add it to the automation.

Terraform: EKS module, and adding an IAM Policy

For the Terraform EKS module from Anton Babenko, we can add a policy through the iam_role_additional_policies parameter – see the node_groups.tf, and in the examples of the AWS EKS Terraform module.

In the 20.0 module, the parameter name has changed – iam_role_additional_policies => node_iam_role_additional_policies, but we are still using version 19.21.0, and the role is added in this way:

...
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.21.0"

  cluster_name    = local.env_name
  cluster_version = var.eks_version
  ...
  vpc_id                   = local.vpc_out.vpc_id
  subnet_ids               = data.aws_subnets.private.ids
  control_plane_subnet_ids = data.aws_subnets.intra.ids

  manage_aws_auth_configmap = true

  eks_managed_node_groups = {
      ...
      # allow SSM
      iam_role_additional_policies = {
        AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
      }
...

Remove what we did manually, deploy the Terraform code, and check that the Policy has been added:

And the connection is working:

$ aws --profile work ssm start-session --target i-011b1c0b5857b0d92

Starting session with SessionId: arseny-pt7d44xp6ibvqcezj2oqjaxv5q
sh-4.2$ bash
[ssm-user@ip-10-0-34-239 bin]$ pwd
/usr/bin

Option 2: AWS EC2 Instance Connect and SSH on EC2

Another way to connect is through EC2 Instance Connect. Documentation – Connect to your Linux instance with EC2 Instance Connect.

It also requires an agent, which is also installed by default on the Amazon Linux.

For instances on private networks, EC2 Instance Connect VPC Endpoint is required for connection.

SecurityGroup and SSH

Instance Connect through the endpoint requires access to port 22, SSH (as opposed to SSM, which opens a connection through the agent itself).

Open the port for all addresses in the VPC:

EC2 Instance Connect VPC Endpoint

Go to the VPC Endpoints and create an endpoint:

Select the EC2 Instance Connect Endpoint type, the VPC itself, and the SecurityGroup:

Choose a Subnet – we have most of the resources in us-east-1a, so we’ll use it to avoid unnecessary cross-AvailabilityZone traffic (see AWS: Cost optimization – services expenses overview and traffic costs in AWS):

Wait a few minutes for the Active status:

And connect using AWS CLI by specifying --connection-type eice, because the instances are on a private network:

$ aws --profile work ec2-instance-connect ssh --instance-id i-011b1c0b5857b0d92 --connection-type eice
...
[ec2-user@ip-10-0-34-239 ~]$ 

Terraform: EC2 Instance Connect, EKS, and VPC

For the Terraform, here you will need to add thenode_security_group_additional_rules in the EKS module for SSH access, and create an EC2 Instance Connect Endpoint for the VPC, as in my case we create VPC and EKS separately.

...
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.21.0"

  cluster_name    = local.env_name
  cluster_version = var.eks_version

  ...
      # allow SSM
      iam_role_additional_policies = {
        AmazonSSMManagedInstanceCore = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
      }
      ...
  }

  node_security_group_name    = "${local.env_name}-node-sg"
  cluster_security_group_name = "${local.env_name}-cluster-sg"

  # to use with EC2 Instance Connect
  node_security_group_additional_rules = {
    ingress_ssh_vpc = {
      description = "SSH from VPC"
      protocol    = "tcp"
      from_port   = 22
      to_port     = 22
      cidr_blocks      = [local.vpc_out.vpc_cidr]
      type        = "ingress"
    }
  }

  node_security_group_tags = {
    "karpenter.sh/discovery" = local.env_name
  }
  ...
}
...

If you created it manually, as described above, then remove the rule from SecurityGroup with SSH and deploy it from Terraform.

For the VPC EC2 Instance Connect Endpoint, I did not find how to do this through Anton Babenko’s module terraform-aws-modules/vpc, but you can make it a separate resource through aws_ec2_instance_connect_endpoint:

resource "aws_ec2_instance_connect_endpoint" "example" {
  subnet_id = module.vpc.private_subnets[0]
  security_group_ids = ["sg-0b70cfd6019c635af"]
}

However, here you need to pass the SecurityGroup ID from the cluster, and the cluster is created after the VPC, so there is a chicken-and-egg problem.

In general, the Instance Connect seems to be a little more complicated than SSM in the automation, because there are more changes in the code, and in different modules.

However, it is a working option, and if your automation allows it, you can use it.

Option 3: the old-fashioned way with SSH Public Key via EC2 User Data

And the oldest and perhaps the simplest option is to create an SSH key yourself and add its public part to EC2 when creating an instance.

The disadvantages here are that it will be difficult to add many keys in this way, and EC2 User Data can sometimes go sideways, but if you need to add only one key, a kind of “super-admin” in case of emergency, then this is a perfectly valid option.

Moreover, if you have a VPN to the VPC (see Pritunl: Launching a VPN in AWS on EC2 with Terraform), then the connection will be even easier.

Create a key:

$ ssh-keygen 
Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/setevoy/.ssh/id_ed25519): /home/setevoy/.ssh/atlas-eks-ec2
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /home/setevoy/.ssh/atlas-eks-ec2
Your public key has been saved in /home/setevoy/.ssh/atlas-eks-ec2.pub
...

The public part can be stored in a repository – copy it:

$ cat ~/.ssh/atlas-eks-ec2.pub 
ssh-ed25519 AAA***VMO setevoy@setevoy-wrk-laptop

Next, a few crutches: the EC2NodeClass in my case is created from the Terraform code through the kubectl_manifest resource. The easiest option that has come to mind so far is to add the public key to the variables, and then use it in the kubectl_manifest.

Later, I will probably move such resources to a dedicated Helm chart.

For now, let’s create a new variable:

variable "karpenter_nodeclass_ssh" {
  type        = string
  default = "ssh-ed25519 AAA***VMO setevoy@setevoy-wrk-laptop"
  description = "SSH Public key for EC2 created by Karpenter"
}

In the EC2NodeClass configuration, add the spec.userData:

resource "kubectl_manifest" "karpenter_node_class" {
  yaml_body = <<-YAML
    apiVersion: karpenter.k8s.aws/v1beta1
    kind: EC2NodeClass
    metadata:
      name: default
    spec:
      amiFamily: AL2
      role: ${module.eks.eks_managed_node_groups["${local.env_name_short}-default"].iam_role_name}
      subnetSelectorTerms:
        - tags:
            karpenter.sh/discovery: "atlas-vpc-${var.environment}-private"
      securityGroupSelectorTerms:
        - tags:
            karpenter.sh/discovery: ${local.env_name}
      tags:
        Name: ${local.env_name_short}-karpenter
        environment: ${var.environment}
        created-by: "karpneter"
        karpenter.sh/discovery: ${local.env_name}
      userData: |
        #!/bin/bash
        mkdir -p ~ec2-user/.ssh/
        touch ~ec2-user/.ssh/authorized_keys
        echo "${var.karpenter_nodeclass_ssh}" >> ~ec2-user/.ssh/authorized_keys
        chmod -R go-w ~ec2-user/.ssh/authorized_keys
        chown -R ec2-user ~ec2-user/.ssh       
  YAML

  depends_on = [
    helm_release.karpenter
  ]
}

If you are using not Amazon Linux, then change the ec2-user to the desired one.

Important Note: Keep in mind that changes to EC2NodeClass will recreate all instances, and that your services are configured for stable operation, see Kubernetes: Providing High Availability for Pods.

Deploy it and check it:

$ kk get ec2nodeclass -o yaml
...
    userData: #!/bin/bash\nmkdir -p ~ec2-user/.ssh/\ntouch ~ec2-user/.ssh/authorized_keys\necho
      \"ssh-ed25519 AAA***VMO setevoy@setevoy-wrk-laptop\" >> ~ec2-user/.ssh/authorized_keys\nchmod -R go-w
      ~ec2-user/.ssh/authorized_keys\nchown -R ec2-user ~ec2-user/.ssh       \n
...

Wait for Karpenter to create a new WorkerNode and try SSH:

$ ssh -i ~/.ssh/hOS/atlas-eks-ec2 [email protected]
...
[ec2-user@ip-10-0-39-73 ~]$ 

Done.

Conclusions.

  • AWS SessionManager: looks like the easiest option in terms of automation, recommended by AWS itself, but you need to think about how to use scp through it (although it seems to be possible through additional moves, see .SSH and SCP with AWS SSM)
  • AWS EC2 Instance Connect: a cool feature from Amazon, but somehow more troublesome to automate, so not our option
  • “grandfathered” SSH: well, the old one is tried and true 🙂 but I don’t really like User Data, because sometimes it can lead to problems with launching instances; however, it is also simple in terms of automation, and gives you the usual SSH without additional movements