How exactly do
resources.requests
and resources.limits
in a Kubernetes manifest works “under the hood”, and how exactly will Linux allocate and limit resources for containers?
So, in Kubernetes for Pods, we can set two main parameters for CPU and Memory – the spec.containers.resources.requests
and spec.containers.resources.limits
fields:
resources.requests
: affects how and where a Pod will be created and how many resources it is guaranteed to receiveresources.limits
: affects how many resources it can consume at most ifresources.limits.memory
is greater than the limit – the pod can be killed with OOMKiller if the WorkerNode does not have enough free memory (the Node Memory Pressure state)- if
resources.limits.cpu
is greater than the limit – then CPU throttling mode will be enabled
If everything is quite clear with Memory – we set the number of bytes, then with CPU everything is a little more interesting.
So first, let’s take a look at how the Linux kernel generally plans how much CPU time will be allocated to each process using the Control Groups mechanism.
Contents
Linux cgroups
Linux Control Groups (cgroups
) is one of the two main kernel mechanisms that provide isolation and control over processes:
- Linux namespaces: create an isolated namespace with its own process tree (PID Namespace), network interfaces (net namespace), User IDs (User namespace), and so on – see What is: Linux namespaces, examples of PID and Network namespaces (in rus)
- Linux cgroups: a mechanism for controlling resources by processes – how much memory, CPU, network resources and disk I/O operations will be available to the process
Groups in the name – because all processes are grouped in a parent-child tree.
Therefore, if a limit of 512 megabytes is set for a parent process, then the sum of the available memory of it and its children cannot exceed 512 MB.
All groups are defined in the /sys/fs/cgroup/
directory, which is connected by a separate file system type – cgroup2
:
$ mount | grep cgro cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)
cgroups
has an older version 1 and a newer version 2, see man cgroups
.
In fact, cgroups
v2 is already a new standard, so we will talk about it – but cgroups
v1 is still present when we talk about Kubernetes.
You can check the version using stat
and the /sys/fs/cgroup/
directory:
$ stat -fc %T /sys/fs/cgroup/ cgroup2fs
If there is tmpfs
here, then this is cgroups
v1.
The /sys/fs/cgroup/
directory
A typical view of a directory on a Linux host – here’s an example from my home laptop running Arch Linux:
$ tree /sys/fs/cgroup/ -d -L 2 /sys/fs/cgroup/ ├── dev-hugepages.mount ├── dev-mqueue.mount ├── init.scope ... ├── system.slice │ ├── NetworkManager.service │ ├── bluetooth.service │ ├── bolt.service ... └── user.slice └── user-1000.slice
The same hierarchy can be seen with systemctl status
or systemd-cgls
.
In systemctl status
, the tree looks like this:
$ systemctl status ● setevoy-work State: running ... Since: Mon 2025-06-09 12:21:11 EEST; 3 weeks 1 day ago systemd: 257.6-1-arch CGroup: / ├─init.scope │ └─1 /sbin/init ├─system.slice │ ├─NetworkManager.service │ │ └─858 /usr/bin/NetworkManager --no-daemon ... │ └─wpa_supplicant.service │ └─1989 /usr/bin/wpa_supplicant -u -s -O /run/wpa_supplicant └─user.slice
Here, all processes are grouped by type:
system.slice
: all systemd services(nginx.service
,docker.service
, etc.)user.slice
: user processesmachine.slice
: virtual machines, containers
Where slice is an abstraction of systemd
by which it groups different processes – see man systemd.slice
You can see which cgroup
a process belongs to in its /proc/<PID>/cgroup
, for example, NetworkManager with PID “858”:
$ cat /proc/858/cgroup 0::/system.slice/NetworkManager.service
The cgroup
slice can also be specified in the systemd file of the service:
$ cat /usr/lib/systemd/system/[email protected] | grep Slice Slice=system.slice
CPU and Memory in cgroups
, and cgroups v1 vs cgroups v2
So, in the cgroup
for the entire slide, you set the parameters of how much CPU and Memory processes of this group can use (hereinafter we will talk only about CPU and Memory).
For example, for my user setevoy
(with ID 1000), we have the files cpu.max
and memory.max
:
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.max max 100000 $ cat /sys/fs/cgroup/user.slice/user-1000.slice/memory.max max
cpu.max
in cgroups v2 replaced cpu.cfs_quota_us
and cpu.cfs_period_us
from cgroup v1
Here, in cpu.max
, we have the settings for how much CPU time will be devoted to my user’s processes.
The format of the file is <quota> <period>
, where <quota>
is the time available to the process (or group), and <period>
is the duration of one period in microseconds (100,000 µs = 100 ms).
In cgroups v1, these values were set in cpu.cfs_quota
– for <quota>
in v2, and cpu.cfs_period_us
– for <period>
in v2.
That is, in the file above we see:
max
: available all the time100000
µs = 100 ms, one CPU period
The CPU period here is the time interval during which the Linux kernel checks how many processes in the cgroup have used CPU: if the group has a quota and the processes have exhausted it, they will be suspended until the end of the current period(CPU throttling).
That is, if a limit of 50,000 (50 ms) is set for a process with a period of
100,000 microseconds (100 ms), then processes can use only 50 ms in each 100 ms “window”.
Memory usage can be seen in the file memory.current
:
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/memory.current 47336714240
Which gives us:
$ echo "47336714240 / 1024 / 1024" | bc 45143
45 gigabytes of memory occupied by 1000 user processes.
You can also check the current resource usage of each group with systemd-cgtop
:
Or by passing the slice name:
For CPU, there are general statistics for the group from the beginning of the creation of processes in this group – cpu.stat
:
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.stat usage_usec 2863938974603 ...
In Kubernetes, cpu.max
and memory.max
will be determined when we set resources.limits.cpu
and resources.limits.memory
.
Why are Kubernetes CPU Limits may be a bad idea?
It is often said that setting CPU limits in Kubernetes is a bad idea.
Why is this so?
Because if we set a limit (i.e., the value != max
in cpu.max
), then when a group of processes uses up its time in the current CPU Time window, these processes will be limited even though the CPU has the ability to fulfill requests in general.
That is, even if there are free cores in the system, but cgroup has already exhausted its cpu.max
in the current period, the processes of this group will be suspended until the end of the period (CPU throttling), regardless of the overall system load.
See For the Love of God, Stop Using CPU Limits on Kubernetes, and Making Sense of Kubernetes CPU Requests And Limits.
Linux CFS and cpu.weight
Above we saw cpu.max
, where my user is allowed to use all available CPU time for each CPU period.
But if the limit is not set (i.e. max
), and several groups of processes want access to the CPU at the same time, then the kernel must decide who should be allocated more CPU time.
To do this, another parameter is set in cgroups – cpu.weight
(in cgroups v2) or cpu.shares
(in cgroups v1): this is the relative priority of a group of processes when determining the CPU access queue.
The value of cpu.weight
is taken into account by Linux CFS (Completely Fair Scheduler) to allocate CPU proportionally among several cgroups. – See CFS Scheduler and Process Scheduling in Linux.
$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.weight 100
The range of values here is from 1 to 10,000, where 1 is the minimum priority and 10,000 is the maximum. The value 100 is the default.
The higher the priority, the more time CFS will allocate to processes in this group.
But this is only taken into account when there is a race for CPU time: when the processor is free, all processes get as much CPU time as they need.
In Kubernetes, cpu.weight
will be determined from resources.requests.cpu
.
But the value of resources.requests.memory
only affects the Kubernetes Scheduler to select a Kubernetes WorkerNode to find a node that has enough free memory.
cpu.weight
vs process nice
In addition to cpu.weight
/cpu.shares
, we also have process nice, which sets the priority of the task.
The difference between them is that cpu.weight
is set at the cgroup level, while nice
is set at the level of a specific process within the same group.
And if a higher value in cpu.weight
indicates a higher priority, then with nice
it is the opposite – the lower the nice value (from -19 to 20 maximum), the more time will be allocated to the process.
If both processes are in the same cgroup, but with different nice
, then nice
will be taken into account.
And if these are different cgroups, then cpu.weight
will be taken into account.
That is, cpu.weight
determines which group of processes is more important to the kernel, and nice
determines which process in the group has priority.
Linux cgroups summary
So, each Control Group determines how much CPU and memory will be allocated to a process.
cpu.max
: determines how much time from each CPU period a process group can spend- Kubernetes manifest values in
resources.limits.cpu
andresources.limits.memory
affect thecpu.max
andmemory.max
settings for the cgroup of the corresponding containers
- Kubernetes manifest values in
memory.max
: how much memory can be used without the risk of being killed by the Out of Memory Killer- Kubernetes manifest value of
resources.requests.memory
affects only the Kubernetes Scheduler for selecting a Kubernetes WorkerNode
- Kubernetes manifest value of
cpu.weight
: determines the priority of a group of processes when the CPU is under load- Kubernetes manifest the value in
resources.requests.cpu
affects thecpu.weight
setting for the cgroup of the corresponding containers
- Kubernetes manifest the value in
Kubernetes Pod resources and Linux cgroups
Okay, now that we’ve figured out cgroups on Linux, let’s take a closer look at how the values in Kubernetes resources.requests
and resources.limits
affect containers.
When we set spec.container.resources
in Deployment or Pod, and Pods are created on a WorkerNode, the kubelet
on that node gets the values from PodSpec
and passes them to the Container Runtime Interface (CRI) (ContainerD or CRI-O).
The CRI converts them into a container specification in JSON, which specifies the appropriate values for the cgroup of this container.
Kubernetes CPU Unit vs cgroup CPU share
In Kubernetes manifests, we specify CPU resources in CPU units: 1 unit == 1 full CPU core – physical or virtual, see CPU resource units.
1 millicpu or millicores is 1/1000 of one CPU core.
One Kubernetes CPU Unit is 1024 CPU shares in the corresponding Linux cgroup.
That is: 1 Kubernetes CPU Unit == 1000 millicpu == 1024 CPU shares in a cgroup.
In addition, there is a nuance with how Kubernetes calculates the cpu.weight
for Pods – because Kubernetes uses CPU shares, which it then translates into cpu.weight
for cgroup v2 – we will see how it looks like below.
Checking Kubernetes Pod resources
in cgroup
Let’s create a test Pod:
apiVersion: v1 kind: Pod metadata: name: nginx-test namespace: default spec: containers: - name: nginx image: nginx resources: requests: cpu: "1" memory: "1Gi" limits: cpu: "1" memory: "1Gi"
Run it and find the appropriate WorkerNode:
$ kk describe pod nginx-test Name: nginx-test Namespace: default Priority: 0 Service Account: default Node: ip-10-0-32-142.ec2.internal/10.0.32.142 ...
Let’s connect via SSH and take a look at the cgroups settings.
Kubernetes kubepods.slice
cgroup
All parameters for Kubernetes Pods are set in the /sys/fs/cgroup/kubepods.slice/
directory:
[root@ip-10-0-32-142 ec2-user]# ls -l /sys/fs/cgroup/kubepods.slice/ ... drwxr-xr-x. 5 root root 0 Jul 2 12:30 kubepods-besteffort.slice drwxr-xr-x. 6 root root 0 Jul 2 12:30 kubepods-burstable.slice drwxr-xr-x. 4 root root 0 Jul 2 12:31 kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice ...
To find out which cgroup slice is responsible for our container, let’s check the running pods in the k8s.io
namespace
[root@ip-10-0-32-142 ec2-user]# ctr -n k8s.io containers ls CONTAINER IMAGE RUNTIME 00d432ee10181ce579af7f0d02a3a04167ced45f8438167f3922e385ed9ab58f 602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-pod-identity-agent:v0.1.29 io.containerd.runc.v2 ... 987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f docker.io/library/nginx:latest io.containerd.runc.v2 ...
Note: the namespace in ctr
are containerd namespaces, not Linux ones, see containerd namespaces for Docker, Kubernetes, and beyond
Our container is“987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f“.
We check all the information on it:
[root@ip-10-0-32-142 ec2-user]# ctr -n k8s.io containers info 987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f ... "linux": { "resources": { "devices": [ { "allow": false, "access": "rwm" } ], "memory": { "limit": 1073741824, "swap": 1073741824 }, "cpu": { "shares": 1024, "quota": 100000, "period": 100000 }, "unified": { "memory.oom.group": "1", "memory.swap.max": "0" } }, "cgroupsPath": "kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice:cri-containerd:987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f", ...
Here we see resources.memory
and resources.cpu
.
Everything is clear with memory
, and in resources.cpu
we have three fields:
shares
: these are ourrequests
from the Pod manifest(PodSpec
)quota
: these are ourlimits
period
: CPU period, which was mentioned above – the “accounting window” for CFS
In cgroupsPath
we see which cgroup slice contains information about this container:
[root@ip-10-0-32-142 ec2-user]# ls -l /sys/fs/cgroup/kubepods.slice/kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice/cri-containerd-987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f.scope/ ... -rw-r--r--. 1 root root 0 Jul 2 12:31 cpu.idle -rw-r--r--. 1 root root 0 Jul 2 12:31 cpu.max ... -rw-r--r--. 1 root root 0 Jul 2 12:31 cpu.weight ... -rw-r--r--. 1 root root 0 Jul 2 12:31 memory.max ...
And the corresponding values in them:
[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/kubepods-pod[...]cca2f.scope/cpu.max 100000 100000
That is, a maximum of 100,000 microseconds from each window of 100,000 microseconds – because we set resources.limits.cpu
== “1”, i.e. “full kernel”.
Kubernetes, cpu.weight
and cgroups v2
But if we take a look at the cpu.weight
file, the picture is as follows:
[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/kubepods-pod[...]cca2f.scope/cpu.weight 39
Where did the value “39” come from?
In the container description, we saw shares
== 1024:
... "linux": { "resources": { ... "cpu": { "shares": 1024, ...
cpu.shares
1024 is the value we set in Kubernetes when we specified resources.requests.cpu
== “1”, because, as mentioned above, “One Kubernetes CPU Unit is 1024 CPU shares”.
That is, for cgroups v1 – in the cpu.shares
file we would have a value of 1024.
But cgroup v2 is a bit more interesting.
“Under the hood, Kubernetes still counts CPU Shares in the format 1 core == 1024 shares, which are then translated into the cgroups v2 format.
If we look at the total cpu.weights
for the entire kubepods.slice
, we will see a value of 76:
[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/cpu.weight 76
Where did the “76” come from?
Let’s check the number of cores on this instance:
[root@ip-10-0-32-142 ec2-user]# lscpu | grep -E '^CPU\(' CPU(s): 2
The formula for calculating cpu.weight
is described in the file group_manager_linux.go#L566
:
... func CpuSharesToCpuWeight(cpuShares uint64) uint64 { return uint64((((cpuShares - 2) * 9999) / 262142) + 1) } ...
Having 2 cores == 2048 CPU shares for v1 – we calculate:
((((2048 - 2) * 9999) / 262142) + 1) 79
That is, the entire kubepods.slice
is assigned a “weight” of 79 cpu.weight
.
But we counted all CPU shares in general – and in fact, part of the CPU is reserved for the system and controllers like kubelet
.
Kubernetes Quality of Service Classes
See Kubernetes: Evicted pods and Pods Quality of Service, and Pod Quality of Service Classes.
In the directory /sys/fs/cgroup/kubepods.slice/
we have:
kubepods-besteffort.slice
: BestEffort QoS: whenrequests
andlimits
are specified, butlimits
are less thanrequests
kubepods-burstable.slice
: for Burstable QoS – when onlyrequests
are specifiedkubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice
: Quarateed QoS – whenrequests
andlimits
are set and equal to each other
Our Pod is exactly Quarateed QoS:
$ kk describe pod nginx-test | grep QoS QoS Class: Guaranteed
Because we set requests
== limits
.
And since we set 1 full core in requests
, Kubernetes through cgroups allocates it half of the total cpu.weight
available for kubepods.slice
:
38*2 76
This is exactly the value we saw in kubepods.slice/cpu.weight
.
That’s why Linux CFS will always give our container half of the available CPU time on both cores – or “one whole core”.
Useful links
- How CPU Weight Is Calculated on VictoriaMetrics blogs
- Making Sense of Kubernetes CPU Requests And Limits
- Cgroups – Deep Dive into Resource Management in Kubernetes
- Resource Management for Pods and Containers
- cgroups (Arch Wiki)
- CPU and Memory Management on Kubernetes with Cgroupsv2