Kubernetes: Pod resources.requests, resources.limits, and Linux cgroups

By | 07/20/2025
 

How exactly do resources.requests and resources.limits in a Kubernetes manifest works “under the hood”, and how exactly will Linux allocate and limit resources for containers?

So, in Kubernetes for Pods, we can set two main parameters for CPU and Memory – the spec.containers.resources.requests and spec.containers.resources.limits fields:

  • resources.requests: affects how and where a Pod will be created and how many resources it is guaranteed to receive
  • resources.limits: affects how many resources it can consume at most if
    • resources.limits.memory is greater than the limit – the pod can be killed with OOMKiller if the WorkerNode does not have enough free memory (the Node Memory Pressure state)
    • if resources.limits.cpu is greater than the limit – then CPU throttling mode will be enabled

If everything is quite clear with Memory – we set the number of bytes, then with CPU everything is a little more interesting.

So first, let’s take a look at how the Linux kernel generally plans how much CPU time will be allocated to each process using the Control Groups mechanism.

Linux cgroups

Linux Control Groups (cgroups) is one of the two main kernel mechanisms that provide isolation and control over processes:

  • Linux namespaces: create an isolated namespace with its own process tree (PID Namespace), network interfaces (net namespace), User IDs (User namespace), and so on – see What is: Linux namespaces, examples of PID and Network namespaces (in rus)
  • Linux cgroups: a mechanism for controlling resources by processes – how much memory, CPU, network resources and disk I/O operations will be available to the process

Groups in the name – because all processes are grouped in a parent-child tree.

Therefore, if a limit of 512 megabytes is set for a parent process, then the sum of the available memory of it and its children cannot exceed 512 MB.

All groups are defined in the /sys/fs/cgroup/ directory, which is connected by a separate file system type – cgroup2:

$ mount | grep cgro
cgroup2 on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot)

cgroups has an older version 1 and a newer version 2, see man cgroups.

In fact, cgroups v2 is already a new standard, so we will talk about it – but cgroups v1 is still present when we talk about Kubernetes.

You can check the version using stat and the /sys/fs/cgroup/ directory:

$ stat -fc %T /sys/fs/cgroup/ 
cgroup2fs

If there is tmpfs here, then this is cgroups v1.

The /sys/fs/cgroup/ directory

A typical view of a directory on a Linux host – here’s an example from my home laptop running Arch Linux:

$ tree /sys/fs/cgroup/ -d -L 2
/sys/fs/cgroup/
├── dev-hugepages.mount
├── dev-mqueue.mount
├── init.scope
...
├── system.slice
│   ├── NetworkManager.service
│   ├── bluetooth.service
│   ├── bolt.service
    ...
└── user.slice
    └── user-1000.slice

The same hierarchy can be seen with systemctl status or systemd-cgls.

In systemctl status, the tree looks like this:

$ systemctl status
● setevoy-work
    State: running
    ...
    Since: Mon 2025-06-09 12:21:11 EEST; 3 weeks 1 day ago
  systemd: 257.6-1-arch
   CGroup: /
           ├─init.scope
           │ └─1 /sbin/init
           ├─system.slice
           │ ├─NetworkManager.service
           │ │ └─858 /usr/bin/NetworkManager --no-daemon
           ...
           │ └─wpa_supplicant.service
           │   └─1989 /usr/bin/wpa_supplicant -u -s -O /run/wpa_supplicant
           └─user.slice

Here, all processes are grouped by type:

  • system.slice: all systemd services(nginx.service, docker.service, etc.)
  • user.slice: user processes
  • machine.slice: virtual machines, containers

Where slice is an abstraction of systemd by which it groups different processes – see man systemd.slice

You can see which cgroup a process belongs to in its /proc/<PID>/cgroup, for example, NetworkManager with PID “858”:

$ cat /proc/858/cgroup 
0::/system.slice/NetworkManager.service

The cgroup slice can also be specified in the systemd file of the service:

$ cat /usr/lib/systemd/system/[email protected] | grep Slice 
Slice=system.slice

CPU and Memory in cgroups, and cgroups v1 vs cgroups v2

So, in the cgroup for the entire slide, you set the parameters of how much CPU and Memory processes of this group can use (hereinafter we will talk only about CPU and Memory).

For example, for my user setevoy (with ID 1000), we have the files cpu.max and memory.max:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.max
max 100000

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/memory.max 
max

cpu.max in cgroups v2 replaced cpu.cfs_quota_us and cpu.cfs_period_us from cgroup v1

Here, in cpu.max, we have the settings for how much CPU time will be devoted to my user’s processes.

The format of the file is <quota> <period>, where <quota> is the time available to the process (or group), and <period> is the duration of one period in microseconds (100,000 µs = 100 ms).

In cgroups v1, these values were set in cpu.cfs_quota – for <quota> in v2, and cpu.cfs_period_us – for <period> in v2.

That is, in the file above we see:

  • max: available all the time
  • 100000 µs = 100 ms, one CPU period

The CPU period here is the time interval during which the Linux kernel checks how many processes in the cgroup have used CPU: if the group has a quota and the processes have exhausted it, they will be suspended until the end of the current period(CPU throttling).

That is, if a limit of 50,000 (50 ms) is set for a process with a period of 100,000 microseconds (100 ms), then processes can use only 50 ms in each 100 ms “window”.

Memory usage can be seen in the file memory.current:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/memory.current 
47336714240

Which gives us:

$ echo "47336714240 / 1024 / 1024" | bc 
45143

45 gigabytes of memory occupied by 1000 user processes.

You can also check the current resource usage of each group with systemd-cgtop:

Or by passing the slice name:

For CPU, there are general statistics for the group from the beginning of the creation of processes in this group – cpu.stat:

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.stat 
usage_usec 2863938974603 
...

In Kubernetes, cpu.max and memory.max will be determined when we set resources.limits.cpu and resources.limits.memory.

Why are Kubernetes CPU Limits may be a bad idea?

It is often said that setting CPU limits in Kubernetes is a bad idea.

Why is this so?

Because if we set a limit (i.e., the value != max in cpu.max), then when a group of processes uses up its time in the current CPU Time window, these processes will be limited even though the CPU has the ability to fulfill requests in general.

That is, even if there are free cores in the system, but cgroup has already exhausted its cpu.max in the current period, the processes of this group will be suspended until the end of the period (CPU throttling), regardless of the overall system load.

See For the Love of God, Stop Using CPU Limits on Kubernetes, and Making Sense of Kubernetes CPU Requests And Limits.

Linux CFS and cpu.weight

Above we saw cpu.max, where my user is allowed to use all available CPU time for each CPU period.

But if the limit is not set (i.e. max), and several groups of processes want access to the CPU at the same time, then the kernel must decide who should be allocated more CPU time.

To do this, another parameter is set in cgroups – cpu.weight (in cgroups v2) or cpu.shares (in cgroups v1): this is the relative priority of a group of processes when determining the CPU access queue.

The value of cpu.weight is taken into account by Linux CFS (Completely Fair Scheduler) to allocate CPU proportionally among several cgroups. – See CFS Scheduler and Process Scheduling in Linux.

$ cat /sys/fs/cgroup/user.slice/user-1000.slice/cpu.weight 
100

The range of values here is from 1 to 10,000, where 1 is the minimum priority and 10,000 is the maximum. The value 100 is the default.

The higher the priority, the more time CFS will allocate to processes in this group.

But this is only taken into account when there is a race for CPU time: when the processor is free, all processes get as much CPU time as they need.

In Kubernetes, cpu.weight will be determined from resources.requests.cpu.

But the value of resources.requests.memory only affects the Kubernetes Scheduler to select a Kubernetes WorkerNode to find a node that has enough free memory.

cpu.weight vs process nice

In addition to cpu.weight/cpu.shares, we also have process nice, which sets the priority of the task.

The difference between them is that cpu.weight is set at the cgroup level, while nice is set at the level of a specific process within the same group.

And if a higher value in cpu.weight indicates a higher priority, then with nice it is the opposite – the lower the nice value (from -19 to 20 maximum), the more time will be allocated to the process.

If both processes are in the same cgroup, but with different nice, then nice will be taken into account.

And if these are different cgroups, then cpu.weight will be taken into account.

That is, cpu.weight determines which group of processes is more important to the kernel, and nice determines which process in the group has priority.

Linux cgroups summary

So, each Control Group determines how much CPU and memory will be allocated to a process.

  • cpu.max: determines how much time from each CPU period a process group can spend
    • Kubernetes manifest values in resources.limits.cpu and resources.limits.memory affect the cpu.max and memory.max settings for the cgroup of the corresponding containers
  • memory.max: how much memory can be used without the risk of being killed by the Out of Memory Killer
    • Kubernetes manifest value of resources.requests.memory affects only the Kubernetes Scheduler for selecting a Kubernetes WorkerNode
  • cpu.weight: determines the priority of a group of processes when the CPU is under load
    • Kubernetes manifest the value in resources.requests.cpu affects the cpu.weight setting for the cgroup of the corresponding containers

Kubernetes Pod resources and Linux cgroups

Okay, now that we’ve figured out cgroups on Linux, let’s take a closer look at how the values in Kubernetes resources.requests and resources.limits affect containers.

When we set spec.container.resources in Deployment or Pod, and Pods are created on a WorkerNode, the kubelet on that node gets the values from PodSpec and passes them to the Container Runtime Interface (CRI) (ContainerD or CRI-O).

The CRI converts them into a container specification in JSON, which specifies the appropriate values for the cgroup of this container.

Kubernetes CPU Unit vs cgroup CPU share

In Kubernetes manifests, we specify CPU resources in CPU units: 1 unit == 1 full CPU core – physical or virtual, see CPU resource units.

1 millicpu or millicores is 1/1000 of one CPU core.

One Kubernetes CPU Unit is 1024 CPU shares in the corresponding Linux cgroup.

That is: 1 Kubernetes CPU Unit == 1000 millicpu == 1024 CPU shares in a cgroup.

In addition, there is a nuance with how Kubernetes calculates the cpu.weight for Pods – because Kubernetes uses CPU shares, which it then translates into cpu.weight for cgroup v2 – we will see how it looks like below.

Checking Kubernetes Pod resources in cgroup

Let’s create a test Pod:

apiVersion: v1
kind: Pod
metadata:
  name: nginx-test
  namespace: default
spec:
  containers:
    - name: nginx
      image: nginx
      resources:
        requests:
          cpu: "1"
          memory: "1Gi"
        limits:
          cpu: "1"
          memory: "1Gi"

Run it and find the appropriate WorkerNode:

$ kk describe pod nginx-test
Name:             nginx-test
Namespace:        default
Priority:         0
Service Account:  default
Node:             ip-10-0-32-142.ec2.internal/10.0.32.142
...

Let’s connect via SSH and take a look at the cgroups settings.

Kubernetes kubepods.slice cgroup

All parameters for Kubernetes Pods are set in the /sys/fs/cgroup/kubepods.slice/ directory:

[root@ip-10-0-32-142 ec2-user]# ls -l /sys/fs/cgroup/kubepods.slice/
...
drwxr-xr-x. 5 root root 0 Jul  2 12:30 kubepods-besteffort.slice
drwxr-xr-x. 6 root root 0 Jul  2 12:30 kubepods-burstable.slice
drwxr-xr-x. 4 root root 0 Jul  2 12:31 kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice
...

To find out which cgroup slice is responsible for our container, let’s check the running pods in the k8s.io namespace

[root@ip-10-0-32-142 ec2-user]# ctr -n k8s.io containers ls
CONTAINER                                                           IMAGE                                                                                             RUNTIME                  
00d432ee10181ce579af7f0d02a3a04167ced45f8438167f3922e385ed9ab58f    602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/eks-pod-identity-agent:v0.1.29                   io.containerd.runc.v2    
...
987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f    docker.io/library/nginx:latest                                                                    io.containerd.runc.v2
...

Note: the namespace in ctr are containerd namespaces, not Linux ones, see containerd namespaces for Docker, Kubernetes, and beyond

Our container is“987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f“.

We check all the information on it:

[root@ip-10-0-32-142 ec2-user]# ctr -n k8s.io containers info 987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f
...
        "linux": {
            "resources": {
                "devices": [
                    {
                        "allow": false,
                        "access": "rwm"
                    }
                ],
                "memory": {
                    "limit": 1073741824,
                    "swap": 1073741824
                },
                "cpu": {
                    "shares": 1024,
                    "quota": 100000,
                    "period": 100000
                },
                "unified": {
                    "memory.oom.group": "1",
                    "memory.swap.max": "0"
                }
            },
            "cgroupsPath": "kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice:cri-containerd:987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f",
...

Here we see resources.memory and resources.cpu.

Everything is clear with memory, and in resources.cpu we have three fields:

  • shares: these are our requests from the Pod manifest(PodSpec)
  • quota: these are our limits
  • period: CPU period, which was mentioned above – the “accounting window” for CFS

In cgroupsPath we see which cgroup slice contains information about this container:

[root@ip-10-0-32-142 ec2-user]# ls -l /sys/fs/cgroup/kubepods.slice/kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice/cri-containerd-987bb39fa50532a89842fe1b7a21d1a5829cdf10949a11ac2d4f30ce4afcca2f.scope/
...
-rw-r--r--. 1 root root 0 Jul  2 12:31 cpu.idle
-rw-r--r--. 1 root root 0 Jul  2 12:31 cpu.max
...
-rw-r--r--. 1 root root 0 Jul  2 12:31 cpu.weight
...
-rw-r--r--. 1 root root 0 Jul  2 12:31 memory.max
...

And the corresponding values in them:

[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/kubepods-pod[...]cca2f.scope/cpu.max
100000 100000

That is, a maximum of 100,000 microseconds from each window of 100,000 microseconds – because we set resources.limits.cpu == “1”, i.e. “full kernel”.

Kubernetes, cpu.weight and cgroups v2

But if we take a look at the cpu.weight file, the picture is as follows:

[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/kubepods-pod[...]cca2f.scope/cpu.weight 
39

Where did the value “39” come from?

In the container description, we saw shares == 1024:

...
        "linux": {
            "resources": {
                ...
                "cpu": {
                    "shares": 1024,
...

cpu.shares 1024 is the value we set in Kubernetes when we specified resources.requests.cpu == “1”, because, as mentioned above, “One Kubernetes CPU Unit is 1024 CPU shares”.

That is, for cgroups v1 – in the cpu.shares file we would have a value of 1024.

But cgroup v2 is a bit more interesting.

“Under the hood, Kubernetes still counts CPU Shares in the format 1 core == 1024 shares, which are then translated into the cgroups v2 format.

If we look at the total cpu.weights for the entire kubepods.slice, we will see a value of 76:

[root@ip-10-0-32-142 ec2-user]# cat /sys/fs/cgroup/kubepods.slice/cpu.weight 
76

Where did the “76” come from?

Let’s check the number of cores on this instance:

[root@ip-10-0-32-142 ec2-user]# lscpu | grep -E '^CPU\('
CPU(s):                                  2

The formula for calculating cpu.weight is described in the file group_manager_linux.go#L566:

...
func CpuSharesToCpuWeight(cpuShares uint64) uint64 {
  return uint64((((cpuShares - 2) * 9999) / 262142) + 1)
}
...

Having 2 cores == 2048 CPU shares for v1 – we calculate:

((((2048 - 2) * 9999) / 262142) + 1) 
79

That is, the entire kubepods.slice is assigned a “weight” of 79 cpu.weight.

But we counted all CPU shares in general – and in fact, part of the CPU is reserved for the system and controllers like kubelet.

Kubernetes Quality of Service Classes

See Kubernetes: Evicted pods and Pods Quality of Service, and Pod Quality of Service Classes.

In the directory /sys/fs/cgroup/kubepods.slice/ we have:

  • kubepods-besteffort.slice: BestEffort QoS: when requests and limits are specified, but limits are less than requests
  • kubepods-burstable.slice: for Burstable QoS – when only requests are specified
  • kubepods-pod32075da9_3540_4960_8677_e3837e04d69f.slice: Quarateed QoS – when requests and limits are set and equal to each other

Our Pod is exactly Quarateed QoS:

$ kk describe pod nginx-test | grep QoS
QoS Class:                   Guaranteed

Because we set requests == limits.

And since we set 1 full core in requests, Kubernetes through cgroups allocates it half of the total cpu.weight available for kubepods.slice:

38*2 76

This is exactly the value we saw in kubepods.slice/cpu.weight.

That’s why Linux CFS will always give our container half of the available CPU time on both cores – or “one whole core”.

Useful links