One day I wondered – how is load balancing between pods is working in Kubernetes?
I.e. – we have an external Load Balancer. Then a Service. And behind it – Pods.
What happens when we are receiving a network packet from the world, and we have a few pods – how the traffic will be distributed between them?
Contents
kube-proxy
So. the rouning rules between pods between a Service and its Pods are controlled by the kube-proxy
service that can be working in one of the three following modes – user space proxy mode, iptables proxy mode, and IPVS proxy mode.
User space proxy mode
Links:
A deprecated mode, previously was the default.
When using this mode, kube-proxy
watch for changes in a cluster and for each new Service will open a TCP port on a WorkerNode.
Then, iptables
on this WorkerNode begins routing traffic from this port to the kube-proxy
service which is acting as a proxy service using the round-robin approach, i.e. by sending traffic to a next pod in its backend’s list. During this, kube-proxy
can try to send a package to another pod if the first one didn’t respond.
iptables proxy mode
Links:
Our case, which we will investigate in this post. Currently, is the default one.
When using this mode, kube-proxy
watch for changes in a cluster and for each new Service will open a TCP port on a WorkerNode.
Then, iptables
on this WorkerNode sends traffic from this port to a Kubernetes Service which is actually a chain in the iptables
rules, and via this chain, traffic goes directly to pods which are a backend for this Service. During this, a targeting pod is selected randomly.
This mode is less expensive for system resources as all necessary operations are performed in the kernel by the netfilter
module. Also, this mode works faster and is more reliable because there is no a “middle-ware” – the kube-proxy
itself.
But if the first pod where a packet was sent did not respond – then a connection fails, while in the user space proxy mode kube-proxy
will try to send it to another pod.
This is why is so important to properly configure Readiness Probes, so Kubernetes will not send traffic to pods that are not ready to accept it.
Furthermore, this mode is more complicated for debugging, because in the user space proxy mode the kube-proxy
will write its logs to the /var/log/kube-proxy
, while with the netfilter
you have to go to debug into the kernel itself.
IPVS proxy mode
Links:
And the most recent mode. It uses the netlink
kernel module and creates new IPVS rules for new Kubernetes Services.
The main is the diversity of the load-balaning modes:
rr
: round-robinlc
: least connection (smallest number of open connections)dh
: destination hashingsh
: source hashingsed
: shortest expected delaynq
: never queue
kube-proxy
config
Let’s go to check which mode is used in our case, in the AWS Lastic Kubernetes Service cluster.
Findkube-proxy
pods:
[simterm]
$ kubectl -n kube-system get pod -l k8s-app=kube-proxy -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-proxy-4prtt 1/1 Running 1 158d 10.3.42.245 ip-10-3-42-245.us-east-2.compute.internal <none> <none> kube-proxy-5b7pd 1/1 Running 0 60d 10.3.58.133 ip-10-3-58-133.us-east-2.compute.internal <none> <none> kube-proxy-66cm5 1/1 Running 0 92d 10.3.58.193 ip-10-3-58-193.us-east-2.compute.internal <none> <none> kube-proxy-8fdsv 1/1 Running 0 70d 10.3.39.145 ip-10-3-39-145.us-east-2.compute.internal <none> <none> kube-proxy-8wbj2 1/1 Running 1 158d 10.3.49.200 ip-10-3-49-200.us-east-2.compute.internal <none> <none> kube-proxy-cnd9c 1/1 Running 1 158d 10.3.47.58 ip-10-3-47-58.us-east-2.compute.internal <none> <none> kube-proxy-cwppt 1/1 Running 0 158d 10.3.48.124 ip-10-3-48-124.us-east-2.compute.internal <none> <none> kube-proxy-dd75p 1/1 Running 1 158d 10.3.43.168 ip-10-3-43-168.us-east-2.compute.internal <none> <none> kube-proxy-p6hb7 1/1 Running 0 158d 10.3.46.137 ip-10-3-46-137.us-east-2.compute.internal <none> <none> kube-proxy-pfjzt 1/1 Running 0 59d 10.3.62.200 ip-10-3-62-200.us-east-2.compute.internal <none> <none> kube-proxy-spckd 1/1 Running 0 70d 10.3.44.14 ip-10-3-44-14.us-east-2.compute.internal <none> <none> kube-proxy-tgl52 1/1 Running 0 59d 10.3.59.159 ip-10-3-59-159.us-east-2.compute.internal <none> <none>
[/simterm]
On every WorkerNode of the cluster, we have a dedicated kube-proxy
instance with the kube-proxy-config ConfigMap attached:
[simterm]
$ kk -n kube-system get pod kube-proxy-4prtt -o yaml apiVersion: v1 kind: Pod ... spec: ... containers: ... volumeMounts: ... - mountPath: /var/lib/kube-proxy-config/ name: config ... volumes: ... - configMap: defaultMode: 420 name: kube-proxy-config name: config
[/simterm]
Check this ConfigMap content:
[simterm]
$ kk -n kube-system get cm kube-proxy-config -o yaml apiVersion: v1 data: config: |- ... mode: "iptables" ...
[/simterm]
Now, when we are more familiar with the kube-proxy
modes – let’s go deeper to see how it works and what iptables
is doing here.
Kubernetes Pod load-balancing
For our journey, let’s take a real application with an Ingress (AWS Application Load Balancer, ALB) which sends traffic to a Kubernetes Service:
[simterm]
$ kk -n eks-dev-1-appname-ns get ingress appname-backend-ingress -o yaml ... - backend: serviceName: appname-backend-svc servicePort: 80 ...
[/simterm]
Check the Service itself:
[simterm]
$ kk -n eks-dev-1-appname-ns get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE appname-backend-svc NodePort 172.20.249.22 <none> 80:31103/TCP 63d
[/simterm]
Here we have the NodePort
type – it’s listening to a TCP port on a WorkerNode.
The ALB e172ad3e-eksdev1appname-abac ALB routes the traffic from clients to the e172ad3e-4caa286edf23ff7e06d AWS TargetGroup:
ЕС2 in this TargetGroup are listening to the 31103 port which we saw in the Service details above:
AWS LoadBalancer traffic modes
A side note: AWS ALB supports two modes for traffic – IP, and Instance Mode.
- Instance mode: the default mode, requires a Kubernetes Service to have the NodePort type, and routes traffic to a TCP port of a WorkerNode
- IP mode: with this mode targets for an ALB are Kubernetes Pods directly instead of the Kubernetes Worker Nodes.
Now, we need to have an access to one of these nodes – connect to a Bastion host, and then to one of the WorkerNode:
[simterm]
ubuntu@ip-10-3-29-14:~$ ssh [email protected] -i .ssh/bttrm-eks-nodegroup-us-east-2.pem Last login: Thu May 28 06:25:27 2020 from ip-10-3-29-32.us-east-2.compute.internal __| __|_ ) _| ( / Amazon Linux 2 AMI ___|\___|___| https://aws.amazon.com/amazon-linux-2/ 39 package(s) needed for security, out of 64 available Run "sudo yum update" to apply all updates. [ec2-user@ip-10-3-49-200 ~]$ sudo -s [root@ip-10-3-49-200 ec2-user]#
[/simterm]
kube-proxy
and iptables
So, a packet from a client came to the WorkerNode.
On this node the kube-proxy
service is binding on the port allocated so no one another service will use it, and also it creates a set of iptables
rules:
[simterm]
[root@ip-10-3-49-200 ec2-user]# netstat -anp | grep 31103 tcp6 0 0 :::31103 :::* LISTEN 4287/kube-proxy
[/simterm]
The packet comes to the 31107 port, where it’s started following by the iptables
filters.
iptables
rules
Links:
- Traversing of tables and chains
- Packet flow in Netfilter and General Networking
- A Deep Dive into Iptables and Netfilter Architecture
The Linux kernel accepts this packet and sends it to the PREROUTING
chain of the nat
table:
See the describing kube-proxy iptables rules.
Check rules in the nat
table and its PREROUTING
chain:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L PREROUTING | column -t Chain PREROUTING (policy ACCEPT) target prot opt source destination KUBE-SERVICES all -- anywhere anywhere /* kubernetes service portals */
[/simterm]
Here we have a target
to the following chain – the KUBE-SERVICES
, which have the next chain – KUBE-NODEPORTS
as the last rule, which captures packets for a Service with the NodePort
type:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SERVICES -n | column -t ... KUBE-NODEPORTS all -- 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in th is chain */ ADDRTYPE match dst-type LOCAL ...
[/simterm]
Check rules in this chain:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-NODEPORTS -n | column -t | grep 31103 KUBE-MARK-MASQ tcp -- 0.0.0.0/0 0.0.0.0/0 /* eks-dev-1-appname-ns/appnamed-backend-svc: */ tcp dpt:31103 KUBE-SVC-TII5GQPKXWC65SRC tcp -- 0.0.0.0/0 0.0.0.0/0 /* eks-dev-1-appname-ns/appname-backend-svc: */ tcp dpt:31103
[/simterm]
And here it is intercepting packets for the dpt:31103
(destination port 31103) and they are sent to the next chain – KUBE-SVC-TII5GQPKXWC65SRC
, check it now:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SVC-TII5GQPKXWC65SRC | column -t Chain KUBE-SVC-TII5GQPKXWC65SRC (2 references) target prot opt source destination KUBE-SEP-N36I6W2ULZA2XU52 all -- anywhere anywhere statistic mode random probability 0.50000000000 KUBE-SEP-4NFMB5GS6KDP7RHJ all -- anywhere anywhere
[/simterm]
Here we can see the next two chains where is the “routing magic” happens – the packet randomly will be sent to one of these chains, each has 0.5 from 1.0 “weight” – statistic mode random probability 0.5
, as per the official Kubernetes documentation:
By default, kube-proxy in iptables mode chooses a backend at random.
See also Turning IPTables into a TCP load balancer for fun and profit.
Check those chains.
The first one:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SEP-N36I6W2ULZA2XU52 -n | column -t Chain KUBE-SEP-N36I6W2ULZA2XU52 (1 references) target prot opt source destination KUBE-MARK-MASQ all -- 10.3.34.219 0.0.0.0/0 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp to:10.3.34.219:3001
[/simterm]
And the second one:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SEP-4NFMB5GS6KDP7RHJ -n | column -t Chain KUBE-SEP-4NFMB5GS6KDP7RHJ (1 references) target prot opt source destination KUBE-MARK-MASQ all -- 10.3.57.124 0.0.0.0/0 DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp to:10.3.57.124:3001
[/simterm]
And here we can see that the DNAT
(Destination NAT) chain is sending the packet to an IP and 3001 port, which is actually our ContainerPort
– check the Deployment:
[simterm]
$ kk -n eks-dev-1-appname-ns get deploy appname-backend -o json | jq '.spec.template.spec.containers[].ports[].containerPort' 3001
[/simterm]
And go to see our pods IPs.
Find the pods:
[simterm]
$ kk -n eks-dev-1-appname-ns get pod NAME READY STATUS RESTARTS AGE appname-backend-768ddf9f54-2nrp5 1/1 Running 0 3d appname-backend-768ddf9f54-pm9bh 1/1 Running 0 3d
[/simterm]
And IP of the first pod:
[simterm]
$ kk -n eks-dev-1-appname-ns get pod appname-backend-768ddf9f54-2nrp5 --template={{.status.podIP}} 10.3.34.219
[/simterm]
And the second one:
[simterm]
$ kk -n eks-dev-1-appname-ns get pod appname-backend-768ddf9f54-pm9bh --template={{.status.podIP}} 10.3.57.124
[/simterm]
Isn’t it great? 🙂 So simple – and so great.
Now, let’s try to scale our Deployment to see how the KUBE-SVC-TII5GQPKXWC65SRC
will be changed to reflect the scaling.
Find the Deployment:
[simterm]
$ kk -n eks-dev-1-appname-ns get deploy NAME READY UP-TO-DATE AVAILABLE AGE appname-backend 2/2 2 2 64d
[/simterm]
Scale it from two to three pods:
[simterm]
$ kk -n eks-dev-1-appname-ns scale deploy appname-backend --replicas=3 deployment.extensions/appname-backend scaled
[/simterm]
Check the iptables
rules:
[simterm]
[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SVC-TII5GQPKXWC65SRC | column -t Chain KUBE-SVC-TII5GQPKXWC65SRC (2 references) target prot opt source destination KUBE-SEP-N36I6W2ULZA2XU52 all -- anywhere anywhere statistic mode random probability 0.33332999982 KUBE-SEP-HDIQCDRXRXBGRR55 all -- anywhere anywhere statistic mode random probability 0.50000000000 KUBE-SEP-4NFMB5GS6KDP7RHJ all -- anywhere anywhere
[/simterm]
Now we can see that our KUBE-SVC-TII5GQPKXWC65SRC
got three rules: the first one with the 0.33332999982 random, as there are two more rules after, then the second one with the 0.5 weight, and the last one – without rules at all.
Check the iptables statistics module.
Actually, “That’s all folks!” (c)
Useful links
- Kubernetes: ClusterIP vs NodePort vs LoadBalancer, Services и Ingress — обзор, примеры
- Kubernetes Networking Demystified: A Brief Guide
- A Deep Dive into Iptables and Netfilter Architecture
- Traversing of tables and chains
- Turning IPTables into a TCP load balancer for fun and profit
- Cracking kubernetes node proxy (aka kube-proxy)
- A minimal IPVS Load Balancer demo
- How to enable IPVS mode on AWS EKS?
- Load-Balancing in Kubernetes