Kubernetes: Cluster Autoscaler – failed to renew lease

By | 04/07/2021
 

We have a Kubernetes Cluster Autoscaler for AWS EC2 WorkerNode groups scaling.

On our Dev cluster sometimes it stop working with the following message in its logs:

...
E0331 08:57:52.264549       1 leaderelection.go:320] error retrieving resource lock kube-system/cluster-autoscaler: Get https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded
I0331 08:58:14.468096       1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition
F0331 08:58:25.568173       1 main.go:428] lost master
...

Also, its pod going to restarts:

kubectl -n kube-system get pod | grep cluster
cluster-autoscaler-864bcb77d7-p5nlv       0/1     Error     261        21d

Check available options on the What are the parameters to CA page.

We have it running as a single pod, so let’s disable a leader election: add an option --leader-elect=false to the command field:

...
          command:
            - ./cluster-autoscaler
            - --v=4
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/{{ eks_cluster_name }}
            - --balance-similar-node-groups
            - --skip-nodes-with-system-pods=false
            - --leader-elect=false
...

Re-deploy it, and now everything is working properly.