AWS EKS Terraform module version v21.0.0 added support for the AWS Provider Version 6.

Documentation – here>>>.

The main changes in the AWS EKS module are the replacement of IRSA with EKS Pod Identity for the Karpenter sub-module:

Native support for IAM roles for service accounts (IRSA) has been removed; EKS Pod Identity is now enabled by default

Also, “The `aws-auth` sub-module has been removed“, but I personally removed it a long time ago.

Some variables have also been renamed.

I wrote about upgrading from version 19 to 20 in Terraform: EKS and Karpenter – upgrade module version from 19.21 to 20.0, and this time we will follow the same path – change the module versions and see what breaks.

I have a separate “Testing” environment for this, which I first roll out with the current versions of modules/providers, then update the code, deploy the upgrade, and when everything is fixed, I upgrade EKS Production (because we have one cluster on dev/staging/prod).

In Karpenter’s own Helm chart, there seem to be no significant changes, although version 1.6 has already been released. You can update it at the same time, but that’s for another time.

Overall, the upgrade went smoothly, but there were two issues that required some debugging: a problem with the EC2 metadata for AWS Load Balancer Controller during the upgrade, and a problem with EKS Add-ons when creating a new cluster with AWS EKS Terraform module v21.x.

Contents

Upgrade AWS EKS Terraform module

Upgrade AWS Provider Version 6

First, change the AWS Provider version – finally, because the open pool requests from Renovate were annoying, and I couldn’t close them.

It’s simple – just change the version to 6:

...
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 6.0"
    }
  }
...

Use the pessimistic constraint operator to allow upgrades of all minor versions.

This will be considered both by Renovate, and when executing terraform init -upgrade.

Upgrade `terraform-aws-modules/eks/aws`

Let’s upgrade the EKS module version – change 20 to 21, also with the “~>“:

...
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> v21.0"
  ...

And Karpenter too, I have it as a separate module:

module "karpenter" {
  source = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> v21.0"
  ...

Run terraform init, and get the “does not match configured version constraint” error, I’ve already described it in the Terraform: “no available releases match the given constraints post:

$ terraform init
...
registry.terraform.io/hashicorp/aws 5.100.0 does not match configured version constraint >= 4.0.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, ~> 5.14, >= 6.0.0
...

Because .terraform.lock.hcl still contains the old version of the AWS provider:

$ cat envs/test-1-33/.terraform.lock.hcl | grep -A 5 5.100
  version     = "5.100.0"
  constraints = ">= 4.0.0, >= 4.33.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, ~> 5.14, >= 5.95.0"

You can drop the file and run terraform init again, or you can run terraform init -upgrade to pull all upgrades at once:

$ terraform init -upgrade

Check .terraform.lock.hcl again – now everything is OK:

$ git diff .terraform.lock.hcl
diff --git a/terraform/envs/test-1-33/.terraform.lock.hcl b/terraform/envs/test-1-33/.terraform.lock.hcl
index bd44714..cb2eace 100644
--- a/terraform/envs/test-1-33/.terraform.lock.hcl
+++ b/terraform/envs/test-1-33/.terraform.lock.hcl
@@ -24,98 +24,85 @@ provider "registry.terraform.io/alekc/kubectl" {
 }
 
 provider "registry.terraform.io/hashicorp/aws" {
-  version     = "5.100.0"
-  constraints = ">= 4.0.0, >= 4.33.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, ~> 5.14, >= 5.95.0"
+  version     = "6.7.0"
+  constraints = ">= 4.0.0, >= 4.36.0, >= 4.47.0, >= 5.0.0, >= 6.0.0, ~> 6.0"
   hashes = [
...

Let’s run terraform plan and see what breaks.

Renamed variables в `terraform-aws-modules/eks/aws`

The first, as expected, were errors about missing variables, because they had been renamed in the module:

$ terraform plan -var-file=test-1-33.tfvars
...
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/eks.tf line 34, in module "eks":
│   34:   cluster_name    = "${var.env_name}-cluster"
│ 
│ An argument named "cluster_name" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/eks.tf line 38, in module "eks":
│   38:   cluster_version = var.eks_version
│ 
│ An argument named "cluster_version" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/eks.tf line 42, in module "eks":
│   42:   cluster_endpoint_public_access = var.eks_params.cluster_endpoint_public_access
│ 
│ An argument named "cluster_endpoint_public_access" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/eks.tf line 46, in module "eks":
│   46:   cluster_enabled_log_types = var.eks_params.cluster_enabled_log_types
│ 
│ An argument named "cluster_enabled_log_types" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/eks.tf line 50, in module "eks":
│   50:   cluster_addons = {
│ 
│ An argument named "cluster_addons" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/eks.tf line 148, in module "eks":
│  148:   cluster_security_group_name = "${var.env_name}-cluster-sg"
│ 
│ An argument named "cluster_security_group_name" is not expected here.
...

Let’s go to the upgrade documentation and find out what the variables are now called:

cluster_name => name
cluster_version => kubernetes_version
cluster_endpoint_public_access => endpoint_public_access
cluster_enabled_log_types => enabled_log_types
cluster_addons -> addons
cluster_security_group_name -> security_group_name

Although, in my opinion, the prefix cluster_* would have been better, because we have node_security_group_name, and there was cluster_security_group_name – it is clear which parameter is for what.

And now there is node_security_group_name and “some” security_group_name.

Removed variables в `terraform-aws-modules/eks/aws//modules/karpenter`

OK, edit the variable names in the main module code, run terraform plan again – now we have errors for changes in the karpenter module:

...
 Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/karpenter.tf line 7, in module "karpenter":
│    7:   irsa_oidc_provider_arn          = module.eks.oidc_provider_arn
│ 
│ An argument named "irsa_oidc_provider_arn" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/karpenter.tf line 8, in module "karpenter":
│    8:   irsa_namespace_service_accounts = ["karpenter:karpenter"]
│ 
│ An argument named "irsa_namespace_service_accounts" is not expected here.
╵
╷
│ Error: Unsupported argument
│ 
│   on ../../modules/atlas-eks/karpenter.tf line 14, in module "karpenter":
│   14:   enable_irsa             = true
│ 
│ An argument named "enable_irsa" is not expected here.

...

They were removed because IRSA no longer exists—an EKS Pod Identity will now be created for Karpenter, see main.tf#L92.

I wrote about EKS Pod Identities in AWS: EKS Pod Identities – a replacement for IRSA? Simplifying IAM access management and in Terraform: managing EKS Access Entries and EKS Pod Identities.

Let’s remove them:

...
  #irsa_oidc_provider_arn          = module.eks.oidc_provider_arn
  #irsa_namespace_service_accounts = ["karpenter:karpenter"]
  #enable_irsa             = true
...

Run terraform plan again.

Important: Karpenter’s EKS Identity Provider Namespace

And here is an important point:

...
  # module.atlas_eks.module.karpenter.aws_eks_pod_identity_association.karpenter[0] will be created
  + resource "aws_eks_pod_identity_association" "karpenter" {
      ...
      + namespace            = "kube-system"
      + region               = "us-east-1"
      + role_arn             = "arn:aws:iam::492***148:role/KarpenterIRSA-atlas-eks-test-1-33-cluster"
      + service_account      = "karpenter"
...

eks_pod_identity_association will be created for the Kubernetes Namespace "kube-system".

If you have Karpenter running in a different namespace, you need to specify it explicitly when calling the module:

...
module "karpenter" {
  source = "terraform-aws-modules/eks/aws//modules/karpenter"
  version = "~> v21.0"

  cluster_name = module.eks.cluster_name
  namespace  = "karpenter"
...

Otherwise, Karpenter will broke, and the WorkerNode Group upgrade will fail because a Node will wait for the Karpenter, which will be in the CrashLoopbackoff, and the Group upgrade will fail.

`eks_managed_node_groups`: attribute “taints”: map of object required

Now there is an error with node group tags:

...
│ The given value is not suitable for module.atlas_eks.module.eks.var.eks_managed_node_groups declared at .terraform/modules/atlas_eks.eks/variables.tf:1205,1-35: element "test-1-33-default": attribute "taints": map of object required.
...

Why? Because:

Variable definitions now contain detailed object types in place of the previously used any type.

See diff 20 vs 21:

So now it should be map(object):

...
  type = map(object({
    key    = string
    value  = optional(string)
    effect = string
  }))
...

And I have taints currently been passed from a variable with an object set(map(string)):

...
variable "eks_managed_node_group_params" {
  description = "EKS Managed NodeGroups setting, one item in the map() per each dedicated NodeGroup"
  type = map(object({
    min_size                   = number
    max_size                   = number
    desired_size               = number
    instance_types             = list(string)
    capacity_type              = string
    taints                     = set(map(string))
    max_unavailable_percentage = number
  }))
}
...

With the following values:

...
eks_managed_node_group_params = {
  default_group = {
    min_size       = 1
    max_size       = 1
    desired_size   = 1
    instance_types = ["t3.medium"]
    capacity_type  = "ON_DEMAND"
    taints = [
      {
        key    = "CriticalAddonsOnly"
        value  = "true"
        effect = "NO_SCHEDULE"
      },
      {
        key    = "CriticalAddonsOnly"
        value  = "true"
        effect = "NO_EXECUTE"
      }
    ]
    max_unavailable_percentage = 100
  }
}
...

So what needs to be done is to change the declaration of the variable in my code:

...
variable "eks_managed_node_group_params" {
  description = "EKS Managed NodeGroups setting, one item in the map() per each dedicated NodeGroup"
  type = map(object({
    min_size                   = number
    max_size                   = number
    desired_size               = number
    instance_types             = list(string)
    capacity_type              = string
    #taints                     = set(map(string))
    taints = optional(map(object({
      key    = string
      value  = optional(string)
      effect = string
    })))
    max_unavailable_percentage = number
  }))
}
...

And update the values – add keys for map{}:

...
eks_managed_node_group_params = {
  default_group = {
    min_size       = 1
    max_size       = 1
    desired_size   = 1
    instance_types = ["t3.medium"]
    capacity_type  = "ON_DEMAND"
    # taints = [
    #   {
    #     key    = "CriticalAddonsOnly"
    #     value  = "true"
    #     effect = "NO_SCHEDULE"
    #   },
    #   {
    #     key    = "CriticalAddonsOnly"
    #     value  = "true"
    #     effect = "NO_EXECUTE"
    #   }
    # ]
      taints = {
        critical_no_sched = {
          key    = "CriticalAddonsOnly"
          value  = "true"
          effect = "NO_SCHEDULE"
        },
        critical_no_exec = {
          key    = "CriticalAddonsOnly"
          value  = "true"
          effect = "NO_EXECUTE"
        }
      }
    max_unavailable_percentage = 100
  }
}
...

Run terraform plan again, and now everything works without errors.

Let’s deploy the updates.

Deploying changes

Run terraform apply, and now we have a new resource with EKS Pod Identity Association for Karpenter – module.atlas_eks.module.karpenter.aws_eks_pod_identity_association.karpenter:

Which wasn’t here in the old cluster with v20.

ALB Controller error: “failed to fetch VPC ID from instance metadata”

There was also a problem with AWS Load Balancer Controller, because after the upgrade it could not connect to IMDS, probably due to switching to v2, see AWS: Instance Metadata Service v1 vs IMDS v2 and working with Kubernetes Pod and Docker containers:

...
{"level":"error","ts":"2025-08-06T07:25:40Z"," logger":"setup","msg":"unable to initialize AWS cloud","error":"failed to get VPC ID: failed to fetch VPC ID from instance metadata: error in fetching vpc id through ec2 metadata: get mac metadata: operation error ec2imds: GetMetadata, canceled, context deadline exceeded"}
...

Actually, we can just pass the parameters explicitly; see the documentation Using the Amazon EC2 instance metadata server version 2 (IMDSv2).

Note the --aws-vpc-tag-key:

optional flag –aws-vpc-tag-key if you have a different key for the tag other than “Name”

First, let’s try setting the parameters manually to check that it works:

Everything is working now.

Now the parameters for the Helm chart, see its values.yaml#L163 – my controllers are installed from aws-ia/eks-blueprints-addons/aws in Terraform when creating the cluster, so set here:

...
    values = [
      <<-EOT
        replicaCount: 1
        region: ${var.aws_region}
        vpcId: ${var.vpc_id}
        tolerations:
        - key: CriticalAddonsOnly
          operator: Exists
      EOT
    ]
...

Start the deployment:

Everything works.

Issue: Node Group Status CREATE_FAILED

Here I will describe a problem that arose only when creating a new EKS cluster with module v21 – upgrading an existing cluster proceeds without these issues.

Actually, here’s the problem: the cluster was created, everything seems OK, but it hangs for a long time on creating the Node Group, and then crashes with the error “unexpected state ‘CREATE_FAILED’“:

...
╷
│ Error: waiting for EKS Node Group (atlas-eks-test-1-33-cluster:test-1-33-default-20250801112636765600000014) create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: i-03f2c73c7211880f7: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster
...

Although there is an EC2 Auto Scaling Group created, and it has an EC2 up and running.

Why?

So, the problem is that WorkerNode has been created but cannot connect to Kubernetes.

The first thing that comes to mind is to check the Security Group, but everything appears to be correct here – all the rules are correct. I compared it with the current EKS cluster, which was created with AWS EKS Terraform module v20.x – everything is the same.

Problem with IAM? EC2 doesn’t have permissions to access the cluster? Again, compare with the old cluster, and everything is OK.

“Check the logs, Billy!”

The funny thing is that SSH is configured on all my EC2 instances, but only for nodes created with Karpenter, as I wrote in AWS: Karpenter and SSH for Kubernetes WorkerNodes.

The current problem arose in the “default” NodeGroup, where various controllers are launched.

So, let’s connect via the AWS Console and select Connect:

Then, in EC2 Instance Connect, select “Connect using a Private IP” and select an existing EC2 Instance Connect Endpoint or quickly create a new one.

Set the username – for Amazon Linux, it is ec2-user:

And let’s look at the logs:

“Container runtime network not ready – cni plugin not initialized”

Actually:

Aug 01 13:26:04 ip-10-0-48-198.ec2.internal kubelet[1619]: E0801 13:26:04.989799    1619 kubelet.go:3126] "Container runtime network not ready" networkReady="NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized"

Wow…

Okay, what’s the situation with VPC CNI?

Let’s go check out EKS Add-ons, and…

It’s completely empty.

Let’s look at the log terraform apply – and we see “Read complete“, but there is no “Creating…“:

...
module.atlas_eks.module.eks.data.aws_eks_addon_version.this["vpc-cni"]: Read complete after 0s [id=vpc-cni]
...

Let’s check if there are any containers on the node – maybe there are some errors?

Wow, once again…

Nothing at all.

Even then, I went back to GitHub Issues and searched for “addon” and found this issue: Managed EKS Node Groups boot without CNI, but addon is added after node group.

Actually, yes – the problem arose due to the absence of the before_compute parameter.

Although it’s a little strange, because it was added in version v19.9, the last time I deployed a cluster from scratch was with v20, and this problem did not occur.

Even more, when I created the Testing cluster from the master branch, where none of the updates described here have been applied, and module version v20 is still used – everything is working without any problems.

And in diff 20 vs 21 I don’t see any significant changes related to before_compute.

However, since this only applies to creating a new cluster, we do not need to add before_compute when simply upgrading. But if you do add it, the add-ons will be recreated.

The before_compute itself was added to allow specifying which adons to create before WorkerNodes and which after. See main.tf#L797 and comments to PR #2478.

Add as in the examples EKS Managed Node Group:

...
    vpc-cni = {
      addon_version = var.eks_addon_versions.vpc_cni
      before_compute = true
      configuration_values = jsonencode({
        env = {
          ENABLE_PREFIX_DELEGATION = "true"
          WARM_PREFIX_TARGET       = "1"
          AWS_VPC_K8S_CNI_EXTERNALSNAT = "true"
        }
      })
    }
    aws-ebs-csi-driver = {
      addon_version            = var.eks_addon_versions.aws_ebs_csi_driver
      service_account_role_arn = module.ebs_csi_irsa_role.iam_role_arn
    }
    eks-pod-identity-agent = {
      addon_version = var.eks_addon_versions.eks_pod_identity_agent
      before_compute = true
    }
...

Run terraform apply again, and here it is:

...
module.atlas_eks.module.eks.aws_eks_addon.before_compute["vpc-cni"]: Creating...
...
module.atlas_eks.module.eks.aws_eks_addon.before_compute["vpc-cni"]: Creation complete after 46s [id=atlas-eks-test-1-33-cluster:vpc-cni]
...

And in the AWS Console:

NodeGroup created without errors:

...
module.atlas_eks.module.eks.module.eks_managed_node_group["test-1-33-default"].aws_eks_node_group.this[0]: Still creating... [01m40s elapsed]
module.atlas_eks.module.eks.module.eks_managed_node_group["test-1-33-default"].aws_eks_node_group.this[0]: Creation complete after 1m49s [id=atlas-eks-test-1-33-cluster:test-1-33-default-20250801142042855800000003]
...

Done.