Kubernetes allows very flexible control over how its Pods will be located on servers, i.e. WorkerNodes.
This can be useful if you need to run a pod on a specific node configuration, for example – a WorkerNode must have a GPU, or an SSD instead of an HDD. Another example is when you need to place individual Pods next to each other to reduce their communication latency, or to reduce cross Availability-zone traffic (see AWS: Grafana Loki, InterZone traffic in AWS, and Kubernetes nodeAffinity).
And, of course, this is important for building a High Availability and Fault Tolerance architecture, when you need to divide pods into individual nodes or Availability Zones.
We have four main approaches to control how Kubernetes Pods are hosted on WorkerNodes:
- configure Nodes in such a way that they will accept only individual Pods that meet the criteria specified on the node
taints
andtolerations
: on the Node we set the taint, for which Pods must have the appropriate toleration to run on this node
- configure the Pod itself in such a way that it will select only individual Nodes that meet the criteria specified in the Pod
- for this, we can use
nodeName
– only a Node with the specified name is selected - or
nodeSelector
to select Nodes with correspondinglabels
and their values - or
nodeAffinity
andnodeAntiAffinity
– the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the parameters of this Node
- for this, we can use
- configure the Pod itself so that it will select a Node based on how other Pods are running
- for this, we use
podAffinity
andpodAntiAffinity
– the rules by which Kubernetes Scheduler will choose a Node to launch the Pod depending on the other Pods on this Node
- for this, we use
- and a separate topic – Pod Topology Spread Constraints, i.e. the rules for placing Pods by failure domains – regions, Availability zones, or nodes
Contents
kubectl explain
Just a tip: you can always read the relevant documentation for any parameter or resource using kubectl explain
:
[simterm]
$ kubectl explain pod KIND: Pod VERSION: v1 DESCRIPTION: Pod is a collection of containers that can run on a host. This resource is created by clients and scheduled onto hosts. ...
[/simterm]
Or:
[simterm]
$ kubectl explain Pod.spec.nodeName KIND: Pod VERSION: v1 FIELD: nodeName <string> DESCRIPTION: NodeName is a request to schedule this pod onto a specific node. If it is non-empty, the scheduler simply schedules this pod onto that node, assuming that it fits resource requirements.
[/simterm]
Node Taints and Pods Tolerations
So, the first option is to set restrictions on the Node on what Pods can be run on it using Taints and Tolerations.
Here a taint
“repels” Pods that do not have a corresponding toleration
to that Node, and a toleration
“pulls” a Pod to a specific Node that has a corresponding one taint
.
For example, we can create a Node on which only Pods with some critical services such as controllers will be launched.
To do so, specify a tain
with the effect: NoSchedule
– that is, prohibit the creation of new Pods on this Node:
[simterm]
$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoSchedule node/ip-10-0-3-133.ec2.internal tainted
[/simterm]
Next, create a Pod with a toleration
with the key "critical-addons"
:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest tolerations: - key: "critical-addons" operator: "Exists" effect: "NoSchedule"
Deploy, and check Pods on that Node:
[simterm]
$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-3-133.ec2.internal NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default my-pod 1/1 Running 0 2m11s 10.0.3.39 ip-10-0-3-133.ec2.internal <none> <none> dev-monitoring-ns atlas-victoriametrics-loki-logs-zxd9m 2/2 Running 0 10m 10.0.3.8 ip-10-0-3-133.ec2.internal <none> <none> ...
[/simterm]
But where does Loki come from? Because while the Taint was set, the Scheduler managed to move a Loki’s Pod to this Node.
To prevent this, add a key NoExecute
to the Tain – then the scheduler will perform Pod eviction to move already running Pods from this Node to other Nodes:
[simterm]
$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoExecute
[/simterm]
Check taints
now:
[simterm]
$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq '.spec.taints' [ { "effect": "NoExecute", "key": "critical-addons", "value": "true" }, { "effect": "NoSchedule", "key": "critical-addons", "value": "true" } ]
[/simterm]
For our Pod add the second one toleration
otherwise, it will be evicted from this Node too:
... tolerations: - key: "critical-addons" operator: "Exists" effect: "NoSchedule" - key: "critical-addons" operator: "Exists" effect: "NoExecute"
Deploy and check Pods on this Node again:
[simterm]
$ kubectl get pod --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-3-133.ec2.internal NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default my-pod 1/1 Running 0 3s 10.0.3.246 ip-10-0-3-133.ec2.internal <none> <none> kube-system aws-node-jrsjz 1/1 Running 0 16m 10.0.3.133 ip-10-0-3-133.ec2.internal <none> <none> kube-system csi-secrets-store-secrets-store-csi-driver-cctbj 3/3 Running 0 16m 10.0.3.144 ip-10-0-3-133.ec2.internal <none> <none> kube-system ebs-csi-node-46fts 3/3 Running 0 16m 10.0.3.187 ip-10-0-3-133.ec2.internal <none> <none> kube-system kube-proxy-6ztqs 1/1 Running 0 16m 10.0.3.133 ip-10-0-3-133.ec2.internal <none> <none>
[/simterm]
Now, on this Node, we have only our Pod, and Pods from DaemonSets which by default should run on all Nodes and have the corresponding tolerations
, see How Daemon Pods are scheduled.
In addition to the Exists
that only checks for the presence of a specified label, it is possible to check the value of this label.
To do so, use Equal
in the operator
, and add a required value:
... tolerations: - key: "critical-addons" operator: "Equal" value: "true" effect: "NoSchedule" - key: "critical-addons" operator: "Equal" value: "true" effect: "NoExecute"
To delete a tain
– add a minus at the end:
[simterm]
$ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoSchedule- node/ip-10-0-3-133.ec2.internal untainted $ kubectl taint nodes ip-10-0-3-133.ec2.internal critical-addons=true:NoExecute- node/ip-10-0-3-133.ec2.internal untainted
[/simterm]
Choosing a Node by a Pod: nodeName
, nodeSelector
, andnodeAffinity
Another approach is when we configure a Pod in such a way that “it” chooses which Node to run on.
For this we have nodeName
, nodeSelector
, nodeAffinity
and nodeAntiAffinity
. See Assign Pods to Nodes.
nodeName
The most straightforward way. Has precedence over all others:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest nodeName: ip-10-0-3-133.ec2.internal
nodeSelector
With the nodeSelector
we can choose Nodes which has a corresponding labels
.
Add a label to the Node:
[simterm]
$ kubectl label nodes ip-10-0-3-133.ec2.internal service=monitoring node/ip-10-0-3-133.ec2.internal labeled
[/simterm]
Check it:
[simterm]
$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq '.metadata.labels' { ... "kubernetes.io/hostname": "ip-10-0-3-133.ec2.internal", "kubernetes.io/os": "linux", "node.kubernetes.io/instance-type": "t3.medium", "service": "monitoring", ...
[/simterm]
In the Pod’s manifest set the nodeSelector
:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest nodeSelector: service: monitoring
If several labels are assigned in the Pod’s nodeSelector
, then the corresponding Node must have all these labels in order for this Pod to run on it.
nodeAffinity
and nodeAntiAffinity
nodeAffinity
and nodeAntiAffinity
operate in the same way as thenodeSelector
, but have more flexible capabilities.
For example, you can set hard or soft launch limits – for a soft limit, the scheduler will try to launch a Pod on the corresponding Node, and if it cannot, it will launch it on another. Accordingly, if you set a hard limit and the scheduler cannot start the Pod on the selected Node, the Pod will remain in Pending status.
The hard limit is set in the field .spec.affinity.nodeAffinity
with the requiredDuringSchedulingIgnoredDuringExecution
, and the soft limit is set with the preferredDuringSchedulingIgnoredDuringExecution
.
For example, we can launch a Pod in AvailabilityZone us-east-1a or us-east-1b using node-label topology.kubernetes.io/zone
:
[simterm]
$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq '.metadata.labels' { ... "topology.kubernetes.io/region": "us-east-1", "topology.kubernetes.io/zone": "us-east-1b" }
[/simterm]
Set a hard-limit:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-east-1a - us-east-1b
Or a soft limit. For example, with a non-existent label:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: non-exist-node-label operator: In values: - non-exist-value
In this case, the Pod will still be launched on whichever Node is most available.
You can also combine conditions:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-east-1a - us-east-1b preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: non-exist-node-label operator: In values: - non-exist-value
When using several conditions in the requiredDuringSchedulingIgnoredDuringExecution.nodeSelectorTerms
, the first one that coincided with the Node’s label will be selected.
When using several conditions in the matchExpressions
field they all must match.
In the operator
you can use operators In
, NotIn
, Exists
, DoesNotExist
, Gt
(greater than) and Lt
(less than).
soft-limit and the weight
In the preferredDuringSchedulingIgnoredDuringExecution
you can set a weight
of the condition setting a value from 1 to 100.
In this case, if all other conditions coincide, the scheduler will select a Node with the largest condition weight
:
apiVersion: v1 kind: Pod metadata: name: my-pod spec: containers: - name: my-container image: nginx:latest affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 1 preference: matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-east-1a - weight: 100 preference: matchExpressions: - key: topology.kubernetes.io/zone operator: In values: - us-east-1b
This Pod will be launched on a Node in the us-east-1b zone:
[simterm]
$ kubectl get pod my-pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-pod 1/1 Running 0 3s 10.0.3.245 ip-10-0-3-133.ec2.internal <none> <none>
[/simterm]
And the zone of this Node:
[simterm]
$ kubectl get node ip-10-0-3-133.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"' us-east-1b
[/simterm]
podAffinity
and podAntiAffinity
Similar to selecting a Node using hard and soft limits, you can adjust Pod Affinity depending on what labels Pods already running on the Node will have. See Inter-pod affinity and anti-affinity.
For example, Grafana Loki has three Pods – Read, Write, and Backend.
We want to run the Read and Backend in the same AvailabilityZone to avoid cross-AZ traffic, but at the same time, we want them not to run on those Nodes where there are Write Pods.
Loki Pods have labels corresponding to a component – app.kubernetes.io/component=read
, app.kubernetes.io/component=backend
, and app.kubernetes.io/component=write
.
So, for the Read Pod, we can set a podAffinity
to Pods with the label app.kubernetes.io/component=backend
, and podAntiAffinity
to Pods with a label app.kubernetes.io/component=read
:
... spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/component operator: In values: - backend topologyKey: "topology.kubernetes.io/zone" podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app.kubernetes.io/component operator: In values: - write topologyKey: "kubernetes.io/hostname" ...
Here in the podAffinity.topologyKey
we set that we want to place Pods using the topology.kubernetes.io/zone
domain – that is, topology.kubernetes.io/zone
for Read Pods must match the Backend Pods.
And in the podAntiAffinity.topologyKey
we set the kubernetes.io/hostname
, that is, do not place on WorkerNodes, where there are Pods with the label app.kubernetes.io/component=write
.
Let’s deploy and check where there is a Write Pod:
[simterm]
$ kubectl -n dev-monitoring-ns get pod loki-write-0 -o json | jq '.spec.nodeName' "ip-10-0-3-53.ec2.internal"
[/simterm]
And AvailabilityZone of this Node:
[simterm]
$ kubectl -n dev-monitoring-ns get node ip-10-0-3-53.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"' us-east-1b
[/simterm]
Check where the Backend Pod is placed:
[simterm]
$ kubectl -n dev-monitoring-ns get pod loki-backend-0 -o json | jq '.spec.nodeName' "ip-10-0-2-220.ec2.internal"
[/simterm]
And its zone:
[simterm]
$ kubectl -n dev-monitoring-ns get node ip-10-0-2-220.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"' us-east-1a
[/simterm]
And now, a Read Pod:
[simterm]
$ kubectl -n dev-monitoring-ns get pod loki-read-698567cdb-wxgj5 -o json | jq '.spec.nodeName' "ip-10-0-2-173.ec2.internal"
[/simterm]
The Node is different from the Write or Backend Nodes, but:
[simterm]
$ kubectl -n dev-monitoring-ns get node ip-10-0-2-173.ec2.internal -o json | jq -r '.metadata.labels."topology.kubernetes.io/zone"' us-east-1a
[/simterm]
The same AvailabilityZone as in the Backend Pod.
Pod Topology Spread Constraints
We can configure Kubernetes Scheduler in such a way that it distributes Pods by “domains”, that is, by nodes, regions, or Availability Zones. See Pod Topology Spread Constraints.
For this, we can set the necessary config in the field spec.topologySpreadConstraints
, which describes exactly how pods will be created.
For example, we have 5 WorkerNodes in two AvailabilityZones.
We want to run 5 Pods and for fault tolerance we want each Pod to be on a separate Node.
Then our config for a Deployment can look like this:
apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment spec: replicas: 5 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: my-container image: nginx:latest topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: my-app
Here:
maxSkew
: the maximum difference in the number of pods in one domain (topologyKey
)- plays a role only if
whenUnsatisfiable=DoNotSchedule
, whenwhenUnsatisfiable=ScheduleAnyway
then a Pod will be created regardless of the conditions
- plays a role only if
whenUnsatisfiable
: can have valueDoNotSchedule
– do not allow Pods to be created, orScheduleAnyway
topologyKey
: WorkerNode label, by which the domain will be selected, that is, by which label we group the Nodes on which the placement of Pods is calculatedlabelSelector
: what Pods to take into account when placing new Pods (for example, if Pods are from different Deployments, but should be placed in the same way – then in both Deployments we configuretopologySpreadConstraints
with mutual oneslabelSelector
)
In addition, you can set the nodeAffinityPolicy
parameters and/or nodeTaintsPolicy
with the Honor
or Ignore
values to configure if nodeAffinity
or nodeTaints
of a Pod must be taken into account during calculating the placement of a Pod.
Let’s deploy and check the Nodes of these Pods:
[simterm]
$ kk get pod -o json | jq '.items[].spec.nodeName' "ip-10-0-3-53.ec2.internal" "ip-10-0-3-22.ec2.internal" "ip-10-0-2-220.ec2.internal" "ip-10-0-2-173.ec2.internal" "ip-10-0-3-133.ec2.internal"
[/simterm]
All are placed on separate Nodes.