Kubernetes: Pods and WorkerNodes – control the placement of the Pods on the Nodes

By | 08/19/2023
 

Kubernetes allows very flexible control over how its Pods will be located on servers, i.e. WorkerNodes.

This can be useful if you need to run a pod on a specific node configuration, for example – a WorkerNode must have a GPU, or an SSD instead of an HDD. Another example is when you need to place individual Pods next to each other to reduce their communication latency, or to reduce cross Availability-zone traffic (see AWS: Grafana Loki, InterZone traffic in AWS, and Kubernetes nodeAffinity).

And, of course, this is important for building a High Availability and Fault Tolerance architecture, when you need to divide pods into individual nodes or Availability Zones.

We have four main approaches to control how Kubernetes Pods are hosted on WorkerNodes:

  • configure Nodes in such a way that they will accept only individual Pods that meet the criteria specified on the node
    • taints and tolerations: on the Node we set the taint, for which Pods must have the appropriate toleration to run on this node
  • configure the Pod itself in such a way that it will select only individual Nodes that meet the criteria specified in the Pod
    • for this, we can use nodeName – only a Node with the specified name is selected
    • or nodeSelector to select Nodes with corresponding labels and their values
    • or nodeAffinity and nodeAntiAffinity – the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the parameters of this Node
  • configure the Pod itself so that it will select a Node based on how other Pods are running
    • for this, we use podAffinity and podAntiAffinity – the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the other Pods on this Node
  • and a separate topic – Pod Topology Spread Constraints, i.e. the rules for placing Pods by failure domains – regions, Availability zones, or nodes

kubectl explain

Just a tip: you can always read the relevant documentation for any parameter or resource using kubectl explain:

[simterm]

$ kubectl explain pod
KIND:       Pod
VERSION:    v1

DESCRIPTION:
    Pod is a collection of containers that can run on a host. This resource is
    created by clients and scheduled onto hosts.
...

[/simterm]

Or:

[simterm]

$ kubectl explain Pod.spec.nodeName
KIND:       Pod
VERSION:    v1

FIELD: nodeName <string>

DESCRIPTION:
    NodeName is a request to schedule this pod onto a specific node. If it is
    non-empty, the scheduler simply schedules this pod onto that node, assuming
    that it fits resource requirements.

[/simterm]

Node Taints and Pods Tolerations

So, the first option is to set restrictions on the Node on what Pods can be run on it using Taints and Tolerations.

Here a taint “repels” Pods that do not have a corresponding toleration to that Node, and a toleration “pulls” a Pod to a specific Node that has a corresponding one taint.

For example, we can create a Node on which only Pods with some critical services such as controllers will be launched.

To do so, specify a tain with the effect: NoSchedule – that is, prohibit the creation of new Pods on this Node:

[simterm]

$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoSchedule
node/ip-10-0-3-133.ec2.internal tainted

[/simterm]

Next, create a Pod with a toleration with the key "critical-addons":

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  tolerations:
    - key: "critical-addons"
      operator: "Exists"
      effect: "NoSchedule"

Deploy, and check Pods on that Node:

[simterm]

$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-3-133.ec2.internal
NAMESPACE           NAME                                                              READY   STATUS    RESTARTS   AGE     IP           NODE                         NOMINATED NODE   READINESS GATES
default             my-pod                                                            1/1     Running   0          2m11s   10.0.3.39    ip-10-0-3-133.ec2.internal   <none>           <none>
dev-monitoring-ns   atlas-victoriametrics-loki-logs-zxd9m                             2/2     Running   0          10m     10.0.3.8     ip-10-0-3-133.ec2.internal   <none>           <none>
...

[/simterm]

But where does Loki come from? Because while the Taint was set, the Scheduler managed to move a Loki’s Pod to this Node.

To prevent this, add a key NoExecute to the Tain – then the scheduler will perform Pod eviction to move already running Pods from this Node to other Nodes:

[simterm]

$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoExecute

[/simterm]

Check taints now:

[simterm]

$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq '.spec.taints'
[
  {
    "effect": "NoExecute",
    "key": "critical-addons",
    "value": "true"
  },
  {
    "effect": "NoSchedule",
    "key": "critical-addons",
    "value": "true"
  }
]

[/simterm]

For our Pod add the second one toleration otherwise, it will be evicted from this Node too:

...
  tolerations:
    - key: "critical-addons"
      operator: "Exists"
      effect: "NoSchedule"
    - key: "critical-addons"
      operator: "Exists"
      effect: "NoExecute"

Deploy and check Pods on this Node again:

[simterm]

$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-3-133.ec2.internal
NAMESPACE     NAME                                               READY   STATUS    RESTARTS   AGE   IP           NODE                         NOMINATED NODE   READINESS GATES
default       my-pod                                             1/1     Running   0          3s    10.0.3.246   ip-10-0-3-133.ec2.internal   <none>           <none>
kube-system   aws-node-jrsjz                                     1/1     Running   0          16m   10.0.3.133   ip-10-0-3-133.ec2.internal   <none>           <none>
kube-system   csi-secrets-store-secrets-store-csi-driver-cctbj   3/3     Running   0          16m   10.0.3.144   ip-10-0-3-133.ec2.internal   <none>           <none>
kube-system   ebs-csi-node-46fts                                 3/3     Running   0          16m   10.0.3.187   ip-10-0-3-133.ec2.internal   <none>           <none>
kube-system   kube-proxy-6ztqs                                   1/1     Running   0          16m   10.0.3.133   ip-10-0-3-133.ec2.internal   <none>           <none>

[/simterm]

Now, on this Node, we have only our Pod, and Pods from DaemonSets which by default should run on all Nodes and have the corresponding tolerations, see How Daemon Pods are scheduled.

In addition to the Exists that only checks for the presence of a specified label, it is possible to check the value of this label.

To do so, use Equal in the operator, and add a required value:

...
  tolerations:
    - key: "critical-addons"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
    - key: "critical-addons"
      operator: "Equal"
      value: "true"
      effect: "NoExecute"

To delete a tain – add a minus at the end:

[simterm]

$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoSchedule-
node/ip-10-0-3-133.ec2.internal untainted
$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoExecute-
node/ip-10-0-3-133.ec2.internal untainted

[/simterm]

Choosing a Node by a Pod: nodeNamenodeSelector, andnodeAffinity

Another approach is when we configure a Pod in such a way that “it” chooses which Node to run on.

For this we have nodeNamenodeSelectornodeAffinity and nodeAntiAffinity. See Assign Pods to Nodes.

nodeName

The most straightforward way. Has precedence over all others:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  nodeName: ip-10-0-3-133.ec2.internal

nodeSelector

With the nodeSelector we can choose Nodes which has a corresponding labels.

Add a label to the Node:

[simterm]

$ kubectl label nodes ip-10-0-3-133.ec2.internal service=monitoring
node/ip-10-0-3-133.ec2.internal labeled

[/simterm]

Check it:

[simterm]

$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq '.metadata.labels'
{
  ...
  "kubernetes.io/hostname": "ip-10-0-3-133.ec2.internal",
  "kubernetes.io/os": "linux",
  "node.kubernetes.io/instance-type": "t3.medium",
  "service": "monitoring",
  ...

[/simterm]

In the Pod’s manifest set the nodeSelector:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  nodeSelector:
    service: monitoring

If several labels are assigned in the Pod’s nodeSelector, then the corresponding Node must have all these labels in order for this Pod to run on it.

nodeAffinity and nodeAntiAffinity

nodeAffinity and nodeAntiAffinityoperate in the same way as thenodeSelector, but have more flexible capabilities.

For example, you can set hard or soft launch limits – for a soft limit, the scheduler will try to launch a Pod on the corresponding Node, and if it cannot, it will launch it on another. Accordingly, if you set a hard limit and the scheduler cannot start the Pod on the selected Node, the Pod will remain in Pending status.

The hard limit is set in the field .spec.affinity.nodeAffinity with the requiredDuringSchedulingIgnoredDuringExecution, and the soft limit is set with the preferredDuringSchedulingIgnoredDuringExecution.

For example, we can launch a Pod in AvailabilityZone us-east-1a or us-east-1b using node-label topology.kubernetes.io/zone:

[simterm]

$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq '.metadata.labels'
{
  ...
  "topology.kubernetes.io/region": "us-east-1",
  "topology.kubernetes.io/zone": "us-east-1b"
}

[/simterm]

Set a hard-limit:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b

Or a soft limit. For example, with a non-existent label:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: non-exist-node-label
            operator: In
            values:
            - non-exist-value

In this case, the Pod will still be launched on whichever Node is most available.

You can also combine conditions:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
            - us-east-1b
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: non-exist-node-label
            operator: In
            values:
            - non-exist-value

When using several conditions in the requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms, the first one that coincided with the Node’s label will be selected.

When using several conditions in the matchExpressions field they all must match.

In the operator you can use operators InNotInExistsDoesNotExistGt (greater than) and Lt (less than).

soft-limit and the weight

In the preferredDuringSchedulingIgnoredDuringExecution you can set a weight of the condition setting a value from 1 to 100.

In this case, if all other conditions coincide, the scheduler will select a Node with the largest condition weight:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: nginx:latest
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1a
      - weight: 100
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - us-east-1b

This Pod will be launched on a Node in the us-east-1b zone:

[simterm]

$ kubectl get pod my-pod -o wide
NAME     READY   STATUS    RESTARTS   AGE   IP           NODE                         NOMINATED NODE   READINESS GATES
my-pod   1/1     Running   0          3s    10.0.3.245   ip-10-0-3-133.ec2.internal   <none>           <none>

[/simterm]

And the zone of this Node:

[simterm]

$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1b

[/simterm]

podAffinity and podAntiAffinity

Similar to selecting a Node using hard and soft limits, you can adjust Pod Affinity depending on what labels Pods already running on the Node will have. See Inter-pod affinity and anti-affinity.

For example, Grafana Loki has three Pods – Read, Write, and Backend.

We want to run the Read and Backend in the same AvailabilityZone to avoid cross-AZ traffic, but at the same time, we want them not to run on those Nodes where there are Write Pods.

Loki Pods have labels corresponding to a component – app.kubernetes.io/component=readapp.kubernetes.io/component=backend, and app.kubernetes.io/component=write.

So, for the Read Pod, we can set a podAffinity to Pods with the label app.kubernetes.io/component=backend, and podAntiAffinity to Pods with a label app.kubernetes.io/component=read:

...
    spec:
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - backend
            topologyKey: "topology.kubernetes.io/zone"
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app.kubernetes.io/component
                operator: In
                values:
                - write
            topologyKey: "kubernetes.io/hostname"
...

Here in the podAffinity.topologyKey we set that we want to place Pods using the  topology.kubernetes.io/zone domain – that is, topology.kubernetes.io/zone for Read Pods must match the Backend Pods.

And in the podAntiAffinity.topologyKey we set the kubernetes.io/hostname, that is, do not place on WorkerNodes, where there are Pods with the label app.kubernetes.io/component=write.

Let’s deploy and check where there is a Write Pod:

[simterm]

$ kubectl -n dev-monitoring-ns get pod loki-write-0 -o json | jq '.spec.nodeName'
"ip-10-0-3-53.ec2.internal"

[/simterm]

And AvailabilityZone of this Node:

[simterm]

$ kubectl -n dev-monitoring-ns get node ip-10-0-3-53.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1b

[/simterm]

Check where the Backend Pod is placed:

[simterm]

$ kubectl -n dev-monitoring-ns get pod loki-backend-0 -o json | jq '.spec.nodeName'
"ip-10-0-2-220.ec2.internal"

[/simterm]

And its zone:

[simterm]

$ kubectl -n dev-monitoring-ns get node ip-10-0-2-220.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1a

[/simterm]

And now, a Read Pod:

[simterm]

$ kubectl -n dev-monitoring-ns get pod loki-read-698567cdb-wxgj5 -o json | jq '.spec.nodeName'
"ip-10-0-2-173.ec2.internal"

[/simterm]

The Node is different from the Write or Backend Nodes, but:

[simterm]

$ kubectl -n dev-monitoring-ns get node ip-10-0-2-173.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"'
us-east-1a

[/simterm]

The same AvailabilityZone as in the Backend Pod.

Pod Topology Spread Constraints

We can configure Kubernetes Scheduler in such a way that it distributes Pods by “domains”, that is, by nodes, regions, or Availability Zones. See Pod Topology Spread Constraints.

For this, we can set the necessary config in the field spec.topologySpreadConstraints, which describes exactly how pods will be created.

For example, we have 5 WorkerNodes in two AvailabilityZones.

We want to run 5 Pods and for fault tolerance we want each Pod to be on a separate Node.

Then our config for a Deployment can look like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 5
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: my-container
          image: nginx:latest
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: my-app

Here:

  • maxSkew: the maximum difference in the number of pods in one domain (topologyKey)
    • plays a role only if whenUnsatisfiable=DoNotSchedule, when whenUnsatisfiable=ScheduleAnyway then a Pod will be created regardless of the conditions
  • whenUnsatisfiable: can have value DoNotSchedule – do not allow Pods to be created, or ScheduleAnyway
  • topologyKey: WorkerNode label, by which the domain will be selected, that is, by which label we group the Nodes on which the placement of Pods is calculated
  • labelSelector: what Pods to take into account when placing new Pods (for example, if Pods are from different Deployments, but should be placed in the same way – then in both Deployments we configure topologySpreadConstraints with mutual ones labelSelector)

In addition, you can set the  nodeAffinityPolicy parameters and/or nodeTaintsPolicy with the Honor ​​or Ignore values to configure if nodeAffinityor nodeTaints of a Pod must be taken into account during calculating the placement of a Pod.

Let’s deploy and check the Nodes of these Pods:

[simterm]

$ kk get pod -o json | jq '.items[].spec.nodeName'
"ip-10-0-3-53.ec2.internal"
"ip-10-0-3-22.ec2.internal"
"ip-10-0-2-220.ec2.internal"
"ip-10-0-2-173.ec2.internal"
"ip-10-0-3-133.ec2.internal"

[/simterm]

All are placed on separate Nodes.