Kubernetes: Service, load balancing, kube-proxy, and iptables

By | 11/01/2020
 

One day I wondered – how is load balancing between pods is working in Kubernetes?

I.e. – we have an external Load Balancer. Then a Service. And behind it – Pods.

What happens when we are receiving a network packet from the world, and we have a few pods – how the traffic will be distributed between them?

kube-proxy

So. the rouning rules between pods between a Service and its Pods are controlled by the kube-proxy service that can be working in one of the three following modes – user space proxy mode, iptables proxy mode, and IPVS proxy mode.

User space proxy mode

Links:

A deprecated mode, previously was the default.

When using this mode, kube-proxy watch for changes in a cluster and for each new Service will open a TCP port on a WorkerNode.

Then, iptables on this WorkerNode begins routing traffic from this port to the kube-proxy service which is acting as a proxy service using the round-robin approach, i.e. by sending traffic to a next pod in its backend’s list. During this, kube-proxy can try to send a package to another pod if the first one didn’t respond.

iptables proxy mode

Links:

Our case, which we will investigate in this post. Currently, is the default one.

When using this mode, kube-proxy watch for changes in a cluster and for each new Service will open a TCP port on a WorkerNode.

Then, iptables on this WorkerNode sends traffic from this port to a Kubernetes Service which is actually a chain in the iptablesrules, and via this chain, traffic goes directly to pods which are a backend for this Service. During this, a targeting pod is selected randomly.

This mode is less expensive for system resources as all necessary operations are performed in the kernel by the netfilter module. Also, this mode works faster and is more reliable because there is no a “middle-ware” – the kube-proxy itself.

But if the first pod where a packet was sent did not respond – then a connection fails, while in the user space proxy mode  kube-proxy will try to send it to another pod.

This is why is so important to properly configure Readiness Probes, so Kubernetes will not send traffic to pods that are not ready to accept it.

Furthermore, this mode is more complicated for debugging, because in the user space proxy mode the kube-proxy will write its logs to the /var/log/kube-proxy, while with the netfilter you have to go to debug into the kernel itself.

IPVS proxy mode

Links:

And the most recent mode. It uses the netlink kernel module and creates new IPVS rules for new Kubernetes Services.

The main is the diversity of the load-balaning modes:

  • rr: round-robin
  • lc: least connection (smallest number of open connections)
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

kube-proxy config

Let’s go to check which mode is used in our case, in the AWS Lastic Kubernetes Service cluster.

Findkube-proxypods:

[simterm]

$ kubectl -n kube-system get pod -l k8s-app=kube-proxy -o wide
NAME               READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
kube-proxy-4prtt   1/1     Running   1          158d   10.3.42.245   ip-10-3-42-245.us-east-2.compute.internal   <none>           <none>
kube-proxy-5b7pd   1/1     Running   0          60d    10.3.58.133   ip-10-3-58-133.us-east-2.compute.internal   <none>           <none>
kube-proxy-66cm5   1/1     Running   0          92d    10.3.58.193   ip-10-3-58-193.us-east-2.compute.internal   <none>           <none>
kube-proxy-8fdsv   1/1     Running   0          70d    10.3.39.145   ip-10-3-39-145.us-east-2.compute.internal   <none>           <none>
kube-proxy-8wbj2   1/1     Running   1          158d   10.3.49.200   ip-10-3-49-200.us-east-2.compute.internal   <none>           <none>
kube-proxy-cnd9c   1/1     Running   1          158d   10.3.47.58    ip-10-3-47-58.us-east-2.compute.internal    <none>           <none>
kube-proxy-cwppt   1/1     Running   0          158d   10.3.48.124   ip-10-3-48-124.us-east-2.compute.internal   <none>           <none>
kube-proxy-dd75p   1/1     Running   1          158d   10.3.43.168   ip-10-3-43-168.us-east-2.compute.internal   <none>           <none>
kube-proxy-p6hb7   1/1     Running   0          158d   10.3.46.137   ip-10-3-46-137.us-east-2.compute.internal   <none>           <none>
kube-proxy-pfjzt   1/1     Running   0          59d    10.3.62.200   ip-10-3-62-200.us-east-2.compute.internal   <none>           <none>
kube-proxy-spckd   1/1     Running   0          70d    10.3.44.14    ip-10-3-44-14.us-east-2.compute.internal    <none>           <none>
kube-proxy-tgl52   1/1     Running   0          59d    10.3.59.159   ip-10-3-59-159.us-east-2.compute.internal   <none>           <none>

[/simterm]

On every WorkerNode of the cluster, we have a dedicated kube-proxy instance with the kube-proxy-config ConfigMap attached:

[simterm]

$ kk -n kube-system get pod kube-proxy-4prtt -o yaml                                                                                                                                           
apiVersion: v1                                                                                                                                                                                                                               
kind: Pod                                                              
...
spec:                                                                                                                                                                                                                                        
...                                                                                                                                                                                
  containers:                                                                                                                                                                                                                                
...                                                                                                                                                                                                 
    volumeMounts:                                                                                                                                                                                                                            
...                                                                                                                                                                                                                     
    - mountPath: /var/lib/kube-proxy-config/                                                                                                                                                                                                 
      name: config
...
  volumes:
...
  - configMap:
      defaultMode: 420
      name: kube-proxy-config
    name: config

[/simterm]

Check this ConfigMap content:

[simterm]

$ kk -n kube-system get cm kube-proxy-config -o yaml
apiVersion: v1
data:
  config: |-
...
    mode: "iptables"
...

[/simterm]

Now, when we are more familiar with the kube-proxy modes – let’s go deeper to see how it works and what iptables is doing here.

Kubernetes Pod load-balancing

For our journey, let’s take a real application with an Ingress (AWS Application Load Balancer, ALB) which sends traffic to a Kubernetes Service:

[simterm]

$ kk -n eks-dev-1-appname-ns get ingress appname-backend-ingress -o yaml
...
      - backend:
          serviceName: appname-backend-svc
          servicePort: 80
...

[/simterm]

Check the Service itself:

[simterm]

$ kk -n eks-dev-1-appname-ns get svc
NAME                              TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
appname-backend-svc   NodePort   172.20.249.22   <none>        80:31103/TCP   63d

[/simterm]

Here we have the NodePorttype – it’s listening to a TCP port on a WorkerNode.

The ALB e172ad3e-eksdev1appname-abac ALB routes the traffic from clients to the  e172ad3e-4caa286edf23ff7e06d AWS TargetGroup:

ЕС2 in this TargetGroup are listening to the 31103 port which we saw in the Service details above:

AWS LoadBalancer traffic modes

Documentation>>>.

A side note: AWS ALB supports two modes for traffic  – IP, and Instance Mode.

  • Instance mode: the default mode, requires a Kubernetes Service to have the NodePort type, and routes traffic to a TCP port of a WorkerNode
  • IP mode: with this mode targets for an ALB are Kubernetes Pods directly instead of the Kubernetes Worker Nodes.

Now, we need to have an access to one of these nodes – connect to a Bastion host, and then to one of the WorkerNode:

[simterm]

ubuntu@ip-10-3-29-14:~$ ssh [email protected] -i .ssh/bttrm-eks-nodegroup-us-east-2.pem 
Last login: Thu May 28 06:25:27 2020 from ip-10-3-29-32.us-east-2.compute.internal

       __|  __|_  )
       _|  (     /   Amazon Linux 2 AMI
      ___|\___|___|

https://aws.amazon.com/amazon-linux-2/
39 package(s) needed for security, out of 64 available
Run "sudo yum update" to apply all updates.
[ec2-user@ip-10-3-49-200 ~]$ sudo -s
[root@ip-10-3-49-200 ec2-user]#

[/simterm]

kube-proxy and iptables

So, a packet from a client came to the WorkerNode.

On this node the kube-proxy service is binding on the port allocated so no one another service will use it, and also it creates a set of iptables rules:

[simterm]

[root@ip-10-3-49-200 ec2-user]# netstat -anp | grep 31103
tcp6       0      0 :::31103                :::*                    LISTEN      4287/kube-proxy

[/simterm]

The packet comes to the 31107 port, where it’s started following by the iptables filters.

iptablesrules

Links:

The Linux kernel accepts this packet and sends it to the PREROUTING chain of the nat table:

See the describing kube-proxy iptables rules.

Check rules in the nattable and its PREROUTING chain:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L PREROUTING | column -t
Chain          PREROUTING  (policy  ACCEPT)                                                  
target         prot        opt      source    destination                                    
KUBE-SERVICES  all         --       anywhere  anywhere     /*  kubernetes  service  portals  */

[/simterm]

Here we have a target to the following chain – the KUBE-SERVICES, which have the next chain  – KUBE-NODEPORTSas the last rule, which captures packets for a Service with the NodePort type:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SERVICES -n  | column -t
...
KUBE-NODEPORTS             all            --   0.0.0.0/0    0.0.0.0/0       /*  kubernetes                                                                           service  nodeports;  NOTE:  this  must       be  the  last  rule  in  th
is  chain  */  ADDRTYPE  match  dst-type  LOCAL
...

[/simterm]

Check rules in this chain:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-NODEPORTS -n  | column -t | grep 31103
KUBE-MARK-MASQ             tcp             --   0.0.0.0/0    0.0.0.0/0    /*  eks-dev-1-appname-ns/appnamed-backend-svc:                    */  tcp  dpt:31103
KUBE-SVC-TII5GQPKXWC65SRC  tcp             --   0.0.0.0/0    0.0.0.0/0    /*  eks-dev-1-appname-ns/appname-backend-svc:                    */  tcp  dpt:31103

[/simterm]

And here it is intercepting packets for the dpt:31103 (destination port 31103) and they are sent to the next chain – KUBE-SVC-TII5GQPKXWC65SRC, check it now:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SVC-TII5GQPKXWC65SRC | column -t 
Chain                      KUBE-SVC-TII5GQPKXWC65SRC  (2   references)                                                     
target                     prot                       opt  source       destination                                        
KUBE-SEP-N36I6W2ULZA2XU52  all                        --   anywhere     anywhere     statistic  mode  random  probability  0.50000000000
KUBE-SEP-4NFMB5GS6KDP7RHJ  all                        --   anywhere     anywhere

[/simterm]

Here we can see the next two chains where is the “routing magic” happens – the packet randomly will be sent to one of these chains, each has 0.5 from 1.0 “weight” – statistic mode random probability 0.5, as per the official Kubernetes documentation:

By default, kube-proxy in iptables mode chooses a backend at random.

See also Turning IPTables into a TCP load balancer for fun and profit.

Check those chains.

The first one:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SEP-N36I6W2ULZA2XU52  -n | column -t
Chain           KUBE-SEP-N36I6W2ULZA2XU52  (1   references)                    
target          prot                       opt  source       destination       
KUBE-MARK-MASQ  all                        --   10.3.34.219  0.0.0.0/0         
DNAT            tcp                        --   0.0.0.0/0    0.0.0.0/0    tcp  to:10.3.34.219:3001

[/simterm]

And the second one:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SEP-4NFMB5GS6KDP7RHJ  -n | column -t
Chain           KUBE-SEP-4NFMB5GS6KDP7RHJ  (1   references)                    
target          prot                       opt  source       destination       
KUBE-MARK-MASQ  all                        --   10.3.57.124  0.0.0.0/0         
DNAT            tcp                        --   0.0.0.0/0    0.0.0.0/0    tcp  to:10.3.57.124:3001

[/simterm]

And here we can see that the DNAT (Destination NAT) chain is sending the packet to an IP and 3001 port, which is actually our ContainerPort – check the Deployment:

[simterm]

$ kk -n eks-dev-1-appname-ns get deploy appname-backend -o json | jq '.spec.template.spec.containers[].ports[].containerPort'
3001

[/simterm]

And go to see our pods IPs.

Find the pods:

[simterm]

$ kk -n eks-dev-1-appname-ns get pod
NAME                                                         READY   STATUS      RESTARTS   AGE
appname-backend-768ddf9f54-2nrp5                 1/1     Running     0          3d
appname-backend-768ddf9f54-pm9bh                 1/1     Running     0          3d

[/simterm]

And IP of the first pod:

[simterm]

$ kk -n eks-dev-1-appname-ns get pod appname-backend-768ddf9f54-2nrp5 --template={{.status.podIP}}
10.3.34.219

[/simterm]

And the second one:

[simterm]

$ kk -n eks-dev-1-appname-ns get pod appname-backend-768ddf9f54-pm9bh --template={{.status.podIP}}
10.3.57.124

[/simterm]

Isn’t it great? 🙂 So simple – and so great.

Now, let’s try to scale our Deployment to see how the KUBE-SVC-TII5GQPKXWC65SRC will be changed to reflect the scaling.

Find the Deployment:

[simterm]

$ kk -n eks-dev-1-appname-ns get deploy
NAME                          READY   UP-TO-DATE   AVAILABLE   AGE
appname-backend   2/2     2            2           64d

[/simterm]

Scale it from two to three pods:

[simterm]

$ kk -n eks-dev-1-appname-ns scale deploy appname-backend --replicas=3
deployment.extensions/appname-backend scaled

[/simterm]

Check the iptables rules:

[simterm]

[root@ip-10-3-49-200 ec2-user]# iptables -t nat -L KUBE-SVC-TII5GQPKXWC65SRC | column -t 
Chain                      KUBE-SVC-TII5GQPKXWC65SRC  (2   references)                                                     
target                     prot                       opt  source       destination                                        
KUBE-SEP-N36I6W2ULZA2XU52  all                        --   anywhere     anywhere     statistic  mode  random  probability  0.33332999982
KUBE-SEP-HDIQCDRXRXBGRR55  all                        --   anywhere     anywhere     statistic  mode  random  probability  0.50000000000
KUBE-SEP-4NFMB5GS6KDP7RHJ  all                        --   anywhere     anywhere

[/simterm]

Now we can see that our KUBE-SVC-TII5GQPKXWC65SRC got three rules: the first one with the 0.33332999982 random, as there are two more rules after, then the second one with the 0.5 weight, and the last one – without rules at all.

Check the iptables statistics module.

Actually, “That’s all folks!” (c)

Useful links