On all my previous projects where Kubernetes was, for its WorkerNodes scaling I’ve used the Cluster Autoscaler (CAS) because actually there were no other options before.
In general, CAS worked well, but in November 2020 AWS released its own solution for scaling nodes in EKS – the Karpenter, and if at first time reviews were mixed, its latest versions are highly praised, so I decided to try it on a new project.
Contents
Karpenter overview, and Karpenter vs Cluster Autoscaler
So what the Karpenter is? It is an auto scaler that starts new WorkerNodes when a Kubernetes cluster has Pods that it cannot start due to insufficient resources on existing WorkerNodes.
Unlike CAS, Karpenter can automatically select the most appropriate instance type depending on the needs of the Pods to be launched.
In addition, it can manage Pods on Nodes to optimize their placement across servers in order to de-scaling WorkerNodes that can be stopped to optimize the cost of the cluster.
Another nice feature is that, unlike CAS, you don’t need to create several WorkerNodes groups with different types of instances – Karpenter can itself determine a type of a Node needed for Pod/s, and create a new Node – no more hassle of choosing “Managed or Self- managed node groups” – you just describe a configuration of what types of instances can be used, and Karpenter itself will create a Node that is needed for each new Pod.
In fact, you completely eliminate the need to interact with AWS for EC2 management – this is all handled by the single component, Karpenter.
Also, Karpenter can handle Terminating and Stopping Events on EC2, and move Pods from Nodes that will be stopped – see native interrupt handling.
Karpenter Best Practices
The complete list is on the Karpenter Best Practices page, and I recommend you look at it. There are also EKS Best Practices Guides – also interesting to read.
Here are the main useful tips:
- The Karpenter control Pod(s) should be run either on the Fargate or on a usual Node from an Autoscale NodeGroup (most likely, I will create one such an ASG for all critical services with a label like “critical-addons” for the Karpenter,
aws-load-balancer-controller
,coredns
,ebs-csi-controller
,external-dns
, etc.) - configure Interruption Handling – then Karpeneter will migrate existing Pods from a Node that will be removed or terminated by Amazon
- if the Kubernetes API is not available externally (and it should be not), then configure the AWS STS VPC endpoint for a VPC of an EKS cluster
- create different provisioners for different teams that use different types of instances (e.g. for Bottlerocket and Amazon Linux)
- configure consolidation for your provisioners – then Karpeneter will try to move running Pods to existing Nodes, or to a smaller Node that will be cheaper than the existing one
- use Time To Live for Nodes created by Karpenter to remove Nodes that are not in use, see How Carpenter nodes are deprovisioned
- add the
karpenter.sh/do-not-evict
annotation for Pods that you don’t want to stop – then Karpenter won’t touch a Node on which such Pods are running even after that Node’s TTL expires - use Limit Ranges to set a default
resources
limits for Pods
Everything looks quite interesting – let’s try to run it.
Karpenter installation
We will use the Karpenter’s Helm chart.
Later, we will do it normally with automation, but for now, to see it closer let’s do it manually.
AWS IAM
KarpenterInstanceNodeRole Role
Go to IAM Roles, and create a new role for WorkerNodes management:
Add Amazon-managed policies:
- AmazonEKSWorkerNodePolicy
- AmazonEKS_CNI_Policy
- AmazonEC2ContainerRegistryReadOnly
- AmazonSSMManagedInstanceCore
Save as KarpenterInstanceNodeRole:
KarpenterControllerRole Role
Add another, for Karpenter himself, here we describe the policy ourselves in JSON.
Go to IAM > Policies, and create your own policy:
{ "Statement": [ { "Action": [ "ssm:GetParameter", "iam:PassRole", "ec2:DescribeImages", "ec2:RunInstances", "ec2:DescribeSubnets", "ec2:DescribeSecurityGroups", "ec2:DescribeLaunchTemplates", "ec2:DescribeInstances", "ec2:DescribeInstanceTypes", "ec2:DescribeInstanceTypeOfferings", "ec2:DescribeAvailabilityZones", "ec2:DeleteLaunchTemplate", "ec2:CreateTags", "ec2:CreateLaunchTemplate", "ec2:CreateFleet", "ec2:DescribeSpotPriceHistory", "pricing:GetProducts" ], "Effect": "Allow", "Resource": "*", "Sid": "Karpenter" }, { "Action": "ec2:TerminateInstances", "Condition": { "StringLike": { "ec2:ResourceTag/Name": "*karpenter*" } }, "Effect": "Allow", "Resource": "*", "Sid": "ConditionalEC2Termination" } ], "Version": "2012-10-17" }
Save as KarpenterControllerPolicy:
Create a second IAM Role with this policy.
You should already have an IAM OIDC identity provider, if not, then go to the documentation Creating an IAM OIDC provider for your cluster.
At the beginning of creating a Role, select the Web Identity in the Select trusted entity, and choose an OpenID Connect provider URL of your cluster in Identity provider. In the Audience field choose the sts.amazonaws.com service:
Next, attach the policy that was made before:
Save as KarpenterControllerRole.
The Trusted Policy should look like this:
An IAM Service Account with the KarpenterControllerRole role will be created by the chart itself.
Security Groups and Subnets tags for Karpenter
Next, you need to add a Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}
tag to SecurityGroups and Subnets that are used by the existing WorkerNodes, to configure where Karpenter will then create new ones.
In the How do I install Karpenter in my Amazon EKS cluster? there is an example of how to do it with two commands in the terminal, but for the first time, I prefer to do it manually.
Find the SecurityGroups and Subnets of our WorkerNode AutoScaling Group – we have only one for now, so it will be simple:
Add tags:
Repeat for Subnets.
aws-auth
ConfigMap
Add a new role to the aws-auth
ConfgiMap for future WorkerNodes to join the cluster.
See Enabling IAM principal access to your cluster.
Let’s back up the ConfigMap:
[simterm]
$ kubectl -n kube-system get configmap aws-auth -o yaml > aws-auth-bkp.yaml
[/simterm]
Edit it:
[simterm]
$ kubectl -n kube-system edit configmap aws-auth
[/simterm]
Add a new mapping to the mapRoles
block – our IAM role for WorkerNodes to RBAC groups system:bootstrappers
and system:nodes
. In the rolearn
we set the KarpenterInstanceNodeRole IAM role, which was made for future WorkerNodes:
... - groups: - system:bootstrappers - system:nodes rolearn: arn:aws:iam::492***148:role/KarpenterInstanceNodeRole username: system:node:{{EC2PrivateDNSName}} ...
For some reason, my aws-auth
ConfigMap was made in one line instead of the common Yaml, maybe because it was created by the AWS CDK, because as far as I remember, with eksctl
it was created normally:
Let’s rewrite a bit, and add a new mapping.
Be careful here, because you can break the cluster. Do not do such things in Production – it all should be done with a Terraform/CDK/Pulumi/etc automation code:
Check that access has not been broken – let’s look at the Nodes:
[simterm]
$ kk get node NAME STATUS ROLES AGE VERSION ip-10-0-2-173.ec2.internal Ready <none> 28d v1.26.4-eks-0a21954 ip-10-0-2-220.ec2.internal Ready <none> 38d v1.26.4-eks-0a21954 ...
[/simterm]
Works? OK.
Installing the Karpenter Helm chart
In the How do I install Karpenter in my Amazon EKS cluster? mentioned above they suggest using the helm template
to build values. A weird solution, as for me, but anyway it’s working.
We will simply create our own values.yaml
– this will be useful for future automation, and we will set the nodeAffinity
and other parameters for the chart.
The default values of the chart itself are here>>>.
Check the labels of our WorkerNode:
[simterm]
$ kk get node ip-10-0-2-173.ec2.internal -o json | jq -r '.metadata.labels."eks.amazonaws.com/nodegroup"' EKSClusterNodegroupNodegrou-zUKXsgSLIy6y
[/simterm]
In our values.yaml
file add an affinity
– do not change the first part, but in the second set the key=eks.amazonaws.com/nodegroup
with the name of the Node Group, the EKSClusterNodegroupNodegrou-zUKXsgSLIy6y:
affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: karpenter.sh/provisioner-name operator: DoesNotExist - matchExpressions: - key: eks.amazonaws.com/nodegroup operator: In values: - EKSClusterNodegroupNodegrou-zUKXsgSLIy6y
In the serviceAccount
add an annotation with the ARN of our KarpenterControllerRole IAM role :
... serviceAccount: create: true annotations: eks.amazonaws.com/role-arn: arn:aws:iam::492***148:role/KarpenterControllerRole
Add a settings
block. Actually, everything is clear from the names of the parameter.
The only thing to note is that in the defaultInstanceProfile
field do not specify the full ARN of the role, but only its name:
... settings: aws: clusterName: eks-dev-1-26-cluster clusterEndpoint: https://2DC***124.gr7.us-east-1.eks.amazonaws.com defaultInstanceProfile: KarpenterInstanceNodeRole
Now we are ready to deploy.
Find the current version of Karpenter on the releases page.
Since we are deploying for the test, you can take the latest one for today – v0.30.0-rc.0.
Let’s deploy from the Helm OCI registry:
[simterm]
$ helm upgrade --install --namespace dev-karpenter-system-ns --create-namespace -f values.yaml karpenter oci://public.ecr.aws/karpenter/karpenter --version v0.30.0-rc.0 --wait
[/simterm]
Check Pods:
[simterm]
$ kk -n dev-karpenter-system-ns get pod NAME READY STATUS RESTARTS AGE karpenter-78f4869696-cnlbh 1/1 Running 0 44s karpenter-78f4869696-vrmrg 1/1 Running 0 44s
[/simterm]
Okay, all good here.
Creating a Default Provisioner
Now we can begin to configure autoscaling.
For this, we first need to add a Provisioner, see Create Provisioner.
In the Provisioner resource, we describe what types of EC2 instances to use, in the providerRef
set a value of the resource name AWSNodeTemplate
, in the consolidation
– enable moving Pods between Nodes to optimize the use of WorkerNodes.
All parameters are in Provisioners – it is very useful to look at them.
Ready-made examples are available in the repository – examples/provisioner.
The resource AWSNodeTemplate
describes exactly where to create new nodes – according to the tag karpenter.sh/discovery=eks-dev-1-26-cluster
that we set earlier on SecurityGroups and Subnets.
All parameters for the AWSNodeTemplate
can be found on the Node Templates page.
So, what do we needed:
- use only T3 small, medium, or large instances
- place new Nodes only in AvailabilityZone us-east-1a or us-east-1b
Create a manifest:
apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: name: default spec: requirements: - key: karpenter.k8s.aws/instance-family operator: In values: [t3] - key: karpenter.k8s.aws/instance-size operator: In values: [small, medium, large] - key: topology.kubernetes.io/zone operator: In values: [us-east-1a, us-east-1b] providerRef: name: default consolidation: enabled: true ttlSecondsUntilExpired: 2592000 ttlSecondsAfterEmpty: 30 --- apiVersion: karpenter.k8s.aws/v1alpha1 kind: AWSNodeTemplate metadata: name: default spec: subnetSelector: karpenter.sh/discovery: eks-dev-1-26-cluster securityGroupSelector: karpenter.sh/discovery: eks-dev-1-26-cluster
Create resources:
[simterm]
$ kk -n dev-karpenter-system-ns apply -f provisioner.yaml provisioner.karpenter.sh/default created awsnodetemplate.karpenter.k8s.aws/default created
[/simterm]
Testing autoscaling with Karpenter
To check that everything is working, you can scale the existing NodeGroup by removing some of the EC2 instances from it.
In my Kubernetes cluster, we have our monitoring running – let’s break it down a bit 😉
Change the AutoScale Group parameters:
Or create a Deployment with the big requests
and the number of replicas
:
apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment spec: replicas: 50 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: nginx resources: requests: memory: "2048Mi" cpu: "1000m" limits: memory: "2048Mi" cpu: "1000m" topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: my-app
Watch the Karpenter logs – a new instance has been created:
[simterm]
2023-08-18T10:42:11.488Z INFO controller.provisioner computed 4 unready node(s) will fit 21 pod(s) {"commit": "f013f7b"} 2023-08-18T10:42:11.497Z INFO controller.provisioner created machine {"commit": "f013f7b", "provisioner": "default", "machine": "default-p7mnx", "requests": {"cpu":"275m","memory":"360Mi","pods":"9"}, "instance-types": "t3.large, t3.medium, t3.small"} 2023-08-18T10:42:12.335Z DEBUG controller.machine.lifecycle created launch template {"commit": "f013f7b", "machine": "default-p7mnx", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/15949964056112399691", "id": "lt-0288ed1deab8c37a7"} 2023-08-18T10:42:12.368Z DEBUG controller.machine.lifecycle discovered launch template {"commit": "f013f7b", "machine": "default-p7mnx", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/10536660432211978551"} 2023-08-18T10:42:12.402Z DEBUG controller.machine.lifecycle discovered launch template {"commit": "f013f7b", "machine": "default-p7mnx", "provisioner": "default", "launch-template-name": "karpenter.k8s.aws/15491520123601971661"} 2023-08-18T10:42:14.524Z INFO controller.machine.lifecycle launched machine {"commit": "f013f7b", "machine": "default-p7mnx", "provisioner": "default", "provider-id": "aws:///us-east-1b/i-060bca40394a24a62", "instance-type": "t3.small", "zone": "us-east-1b", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"17Gi","memory":"1418Mi","pods":"11"}}
[/simterm]
And in a minute check the Nodes in the cluster:
[simterm]
$ kk get node NAME STATUS ROLES AGE VERSION ip-10-0-2-183.ec2.internal Ready <none> 6m34s v1.26.6-eks-a5565ad ip-10-0-2-194.ec2.internal Ready <none> 19m v1.26.4-eks-0a21954 ip-10-0-2-212.ec2.internal Ready <none> 6m38s v1.26.6-eks-a5565ad ip-10-0-3-210.ec2.internal Ready <none> 6m38s v1.26.6-eks-a5565ad ip-10-0-3-84.ec2.internal Ready <none> 6m36s v1.26.6-eks-a5565ad ip-10-0-3-95.ec2.internal Ready <none> 6m35s v1.26.6-eks-a5565ad
[/simterm]
Or in the AWS Console by the karpenter.sh/managed-by
tag:
Done.
What is left to be done:
- for the default Node Group, which is created with a cluster from the AWS CDK, add the
critical-addons=true
tag andtains
withNoExecute
andNoSchedule
rules – this will be a dedicated group for all controllers (see Kubernetes: Pods and WorkerNodes – control the placement of the Pods on the Nodes) - add tag
Key=karpenter.sh/discovery,Value=${CLUSTER_NAME}
in the cluster’s automation for WorkerNodes, SecurityGroups, and Private Subnets - in the chart values for AWS ALB Controller deployment, ExternalDNS, and Karpenter itself, add
tolerations
with thecritical-addons=true
tag, andNoExecute
withNoSchedule
That’s all for now.
All Pods are up, everything is working.
And a couple of useful commands to check Pod/Nod status.
Output the number of Pods on each Node:
[simterm]
$ kubectl get pods -A -o jsonpath='{range .items[?(@.spec.nodeName)]}{.spec.nodeName}{"\n"}{end}' | sort | uniq -c | sort -rn
[/simterm]
Display Pods on a Node:
[simterm]
$ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-2-212.ec2.internal
[/simterm]
Also, you can add plugins for kubectl
to check used resources on Nodes – see Kubernetes: Krew plugin manager and useful plugins for kubectl.
Oh, and will be good to play around with the Vertical Pod Autoscaler – how Karpenter will deal with it.
Useful links
- Getting Started with Karpenter
- Karpenter Best Practices
- Control Pod Density
- Deprovisioning Controller
- How do I install Karpenter in my Amazon EKS cluster?