На Dev-кластере Elastic Kubernetes Service несколько неймспейсов зависли при удалении – остаются в Terminating.
Содержание
“401 Unauthorized”, response: “Unauthorized”
Помня похожий случай, где приичной стал metrics-server
, см. Kubernetes: namespace висит в Terminating и неочевидности с metrics-server – первым делом пошёл проверять его логи:
[simterm]
$ kk -n kube-system logs -f metrics-server-5f956b6d5f-r7v8f ... E0416 11:54:47.022378 1 manager.go:111] unable to fully collect metrics: unable to fully scrape metrics from source kubelet_summary:ip-10-21-39-158.us-east-2.compute.internal: unable to fetch metrics from Kubelet ip-10-21-39-158.us-east-2.compute.internal (ip-10-21-39-158.us-east-2.compute.internal): request failed - "401 Unauthorized", response: "Unauthorized" ...
[/simterm]
request failed – “401 Unauthorized”, response: “Unauthorized” – ага.
И kubectl top node
для этой ноды не работает:
[simterm]
$ kubectl top node NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ... ip-10-21-39-158.us-east-2.compute.internal <unknown> <unknown> <unknown> <unknown>
[/simterm]
Kubelet stopped posting node status
Проверим логи kube-controller-manager
в AWS CloudWatch:
node_lifecycle_controller.go:1127] node ip-10-21-39-158.us-east-2.compute.internal hasn’t been updated for 666h59m43.490707998s. Last Ready is: &NodeCondition{Type:Ready,Status:Unknown,LastHeartbeatTime:2021-03-11 06:02:06 +0000 UTC,LastTransitionTime:2021-03-11 05:51:24 +0000 UTC,Reason:NodeStatusUnknown,Message:Kubelet stopped posting node status.,}
Ошибка “Reason:NodeStatusUnknown,Message:Kubelet stopped posting node status“.
Проверим метрики Prometheus – kube_node_status_condition{condition="Ready",status="true",env=~".*dev.*"} == 0
:
У нас есть алерт на подобный случай:
- alert: KubernetesNodeNotReady expr: kube_node_status_condition{condition="Ready", status="true", env=~".*prod.*"} == 0 for: 15m labels: severity: warning annotations: summary: "Kubernetes node is not in the Ready state" description: "*Kubernetes cluster*: `{{ $labels.ekscluster }}`\nNode `{{ $labels.node }}`\nEC2 instance: `{{ $labels.instance }}` has been unready for a long time" tags: kubernetes, aws
Но работает он только для Production.
kubelet
: use of closed network connection
Логинимся по SSH на WorkerNode, проверяем логи:
[simterm]
[root@ip-10-21-39-158 ec2-user]# journalctl -u kubelet ... Mar 18 11:27:25 ip-10-21-39-158.us-east-2.compute.internal kubelet[584]: E0318 11:27:25.616937 584 webhook.go:111] Failed to make webhook authenticator request: Post https://676***892.gr7.us-east-2.eks.amazonaws.com/apis/authentication.k8s.io/v1/tokenreviews: write tcp 10.21.39.158:33930->10.21.40.129:443: use of closed network connection Mar 18 11:27:25 ip-10-21-39-158.us-east-2.compute.internal kubelet[584]: E0318 11:27:25.617339 584 server.go:263] Unable to authenticate the request due to an error: Post https://676***892.gr7.us-east-2.eks.amazonaws.com/apis/authentication.k8s.io/v1/tokenreviews: write tcp 10.21.39.158:33930->10.21.40.129:443: use of closed network connection ...
[/simterm]
По запросу “kubelet use of closed network connection” гуглится много обсуждений, например тут>>>, но копать не будем – попробуем:
AWS was not able to validate the provided access credentials
Но и после ребута kubelet
заводиться не захотел, теперь с ошибкой “AWS was not able to validate the provided access credentials“:
[simterm]
... Apr 19 10:06:23 ip-10-21-39-158.us-east-2.compute.internal kubelet[19492]: I0419 10:06:23.691844 19492 aws.go:1289] Building AWS cloudprovider Apr 19 10:06:23 ip-10-21-39-158.us-east-2.compute.internal kubelet[19492]: F0419 10:06:23.740946 19492 server.go:274] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-04e759611b9075cd2: "error listing AWS instances: \"AuthFailure: AWS was not able to validate the provided access credentials\\n\\tstatus code: 401, request id: 81dc17d9-ceae-47b9-ba74-6903a9a1be87\"" ...
[/simterm]
Штош…
Ребутаем весь инстанс:
[simterm]
[root@ip-10-21-39-158 ec2-user]# reboot
[/simterm]
И после ребута всё завелось:
[simterm]
... Apr 19 09:05:32 ip-10-21-39-158.us-east-2.compute.internal kubelet[3901]: I0419 09:05:32.667289 3901 aws.go:1289] Building AWS cloudprovider Apr 19 09:05:32 ip-10-21-39-158.us-east-2.compute.internal kubelet[3901]: I0419 09:05:32.827269 3901 tags.go:79] AWS cloud filtering on ClusterID: bttrm-eks-dev-1-18 Apr 19 09:05:32 ip-10-21-39-158.us-east-2.compute.internal kubelet[3901]: I0419 09:05:32.836393 3901 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/eksctl/ca.crt Apr 19 09:05:32 ip-10-21-39-158.us-east-2.compute.internal kubelet[3901]: I0419 09:05:32.921831 3901 server.go:647] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to / ...
[/simterm]
Проверяем kubectl top node
ещё раз:
[simterm]
$ kubectl top node ip-10-21-39-158.us-east-2.compute.internal NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% ip-10-21-39-158.us-east-2.compute.internal 102m 5% 502Mi 15%
[/simterm]
Готово.