We have a Kubernetes Cluster Autoscaler for AWS EC2 WorkerNode groups scaling.
On our Dev cluster sometimes it stop working with the following message in its logs:
[simterm]
... E0331 08:57:52.264549 1 leaderelection.go:320] error retrieving resource lock kube-system/cluster-autoscaler: Get https://172.20.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cluster-autoscaler: context deadline exceeded I0331 08:58:14.468096 1 leaderelection.go:277] failed to renew lease kube-system/cluster-autoscaler: timed out waiting for the condition F0331 08:58:25.568173 1 main.go:428] lost master ...
[/simterm]
Also, its pod going to restarts:
[simterm]
$ kubectl -n kube-system get pod | grep cluster cluster-autoscaler-864bcb77d7-p5nlv 0/1 Error 261 21d
[/simterm]
Check available options on the What are the parameters to CA page.
We have it running as a single pod, so let’s disable a leader election: add an option --leader-elect=false
to the command
field:
... command: - ./cluster-autoscaler - --v=4 - --stderrthreshold=info - --cloud-provider=aws - --skip-nodes-with-local-storage=false - --expander=least-waste - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/{{ eks_cluster_name }} - --balance-similar-node-groups - --skip-nodes-with-system-pods=false - --leader-elect=false ...
Re-deploy it, and now everything is working properly.