cgroups v2: unified hierarchy, PSI, eBPF control

Why cgroups v2 exists

[[cgroups|cgroups v1]] (since 2007) uses a per-controller hierarchy: each resource (cpu, memory, blkio, net_cls...) gets its own tree under /sys/fs/cgroup/<controller>/.... A process could sit in a different cgroup in each hierarchy.

That design caused problems:

Inconsistent semantics across controllers. cpuset worked one way, memory another.
Hard to delegate control. A sub-tree held only a subset of the controllers.
No reliable way to describe "the limits for a process." You had to compute the intersection.
net_cls/net_prio became obsolete in favor of eBPF.

In 2016 (kernel 4.5), cgroups v2 arrived with a single tree and clean semantics. After five years of polish, cgroups v2 became the default in:

systemd 247+ (2020-) in hybrid mode
RHEL 9 (2022) as pure v2
Ubuntu 21.10+ as pure v2
Kubernetes 1.25+ with support
Docker 20.10+ with support

Unified hierarchy

A single tree under /sys/fs/cgroup/:

/sys/fs/cgroup/

├── cgroup.controllers          # available controllers

├── cgroup.subtree_control      # which are enabled for children

├── system.slice/

│   ├── cgroup.controllers

│   ├── cpu.weight

│   ├── memory.high

│   ├── memory.max

│   ├── nginx.service/

│   │   ├── cpu.stat

│   │   ├── memory.current

│   │   └── pids.current

│   └── postgresql.service/

│       └── ...

└── user.slice/

    └── user-1000.slice/

        └── session-1.scope/

Every node is a directory with the same set of files (which files depend on the enabled controllers).

A process lives in exactly one node of the tree (through cgroup.procs).

Controllers: what v2 offers

Controller	v1 equivalent	What it does
cpu	cpu+cpuacct	weight, max, period: CPU shares and a hard limit
cpuset	cpuset	pinning to CPUs and NUMA nodes
memory	memory	usage, low, high, max, soft limit
io	blkio	weight, max BW, BW limits
pids	pids	cap on the number of processes
rdma	rdma	InfiniBand resources
misc	-	gpu hpu: customizable limits
hugetlb	hugetlb	huge page allocations

v1 controllers that do not exist in v2 (deprecated):

net_cls, net_prio → moved into [[ebpf-basics|eBPF cgroup-attached programs]]
freezer → now through the cgroup.freeze file
devices → through eBPF (BPF_PROG_TYPE_CGROUP_DEVICE)
perf_event, cpuacct → folded into cpu

To enable a controller in a subtree, write to cgroup.subtree_control:

echo "+memory +pids" > /sys/fs/cgroup/system.slice/cgroup.subtree_control

Children now expose the memory.* and pids.* files.

Memory controller: new fields

Memory control in v2 is more flexible:

File	Semantics
`memory.current`	current usage, bytes
`memory.min`	reserved, never reclaimed (if there is any way to avoid it)
`memory.low`	best-effort protected, reclaimed only under pressure
`memory.high`	soft limit: the kernel starts throttling and reclaiming
`memory.max`	hard limit: exceeding it means OOM inside this cgroup
`memory.swap.max`	a separate swap limit
`memory.events`	counters: low, high, max, oom, oom_kill
`memory.stat`	detailed breakdown (anon, file, slab, sock, ...)

memory.high is the main new knob. Unlike the v1 hard limit:

When it is exceeded, the kernel slows the cgroup down (it sleeps on page faults) and actively reclaims page cache, but does not kill processes.
If the pressure lasts, the OOM killer (oom-killer) fires eventually.

This gives you graceful backpressure rather than an OOM-kill loop.

CPU controller: weight and max

cpu.weight       # 1-10000 (default 100), relative weight

cpu.max          # "max <quota> <period>", e.g. "100000 100000" = 1 CPU

cpu.stat         # usage_usec, user_usec, system_usec, throttled_usec

v1 had two controllers: cpu (weight) and cpuacct (statistics). v2 merges them.

I/O controller: weight, max, cost-based

io.weight        # default 100, relative

io.max           # rbps/wbps/riops/wiops limits per device

io.stat          # counters per device

io.cost.qos      # cost-based QoS (RHEL 9+)

io.cost (new) does not set limits in IOPS or bps. It uses a "cost budget" that accounts for device performance characteristics. It adapts better to different disks (NVMe vs HDD).

PSI: Pressure Stall Information

In v1 you could not answer: "is the system, or this specific cgroup, overloaded?" Load average was a single host-wide number, and imprecise.

PSI (Linux 4.20+) exposes three files on the root and on every cgroup:

/proc/pressure/cpu for CPU pressure
/proc/pressure/memory for memory
/proc/pressure/io for I/O

The format:

cat /sys/fs/cgroup/system.slice/postgresql.service/io.pressure

some avg10=12.34 avg60=8.91 avg300=5.12 total=12345678

full avg10=3.45  avg60=2.10 avg300=1.05 total=3456789

some: at least one task in the cgroup is waiting for the resource
full: all tasks are waiting (the cgroup is fully stalled)
avg10/60/300: percent of time over the last 10/60/300 seconds
total: total microseconds

Where you use it:

Auto-scaling Kubernetes on PSI instead of CPU%
Out-of-memory prediction. If memory.pressure full goes above 50%, an OOM is imminent.
systemd-oomd uses PSI for proactive killing well before the [[oom-killer|OOM]] would act.

systemd and cgroup delegation

systemd is the only writer in the cgroup tree by default:

system.slice holds system services
user.slice holds user sessions
machine.slice holds VMs and containers (machinectl)

Each unit gets its own cgroup. You set limits through unit properties:

[Service]

CPUWeight=200

MemoryHigh=512M

MemoryMax=1G

IOWeight=300

TasksMax=200

This is equivalent to writing cgroup.subtree_control plus *.max and weight.

To delegate a sub-tree to another process (Docker, k8s):

Delegate=yes

The process can then create sub-cgroups and change limits within its own sub-tree. This is what containers use.

eBPF plus cgroup: programmable control

v2 integrates with [[ebpf-basics|eBPF]] through cgroup-attached programs:

BPF_PROG_TYPE_CGROUP_SKB: L3/L4 filtering per cgroup
BPF_PROG_TYPE_CGROUP_SOCK: control at socket creation
BPF_PROG_TYPE_CGROUP_SOCKOPT: intercept setsockopt/getsockopt
BPF_PROG_TYPE_CGROUP_DEVICE: device whitelist
BPF_PROG_TYPE_LSM_CGROUP: LSM hook per cgroup

This replaces the old v1 controllers (devices, net_cls/net_prio). Cilium uses cgroup-eBPF for per-pod service routing.

You attach it through bpftool cgroup attach:

bpftool cgroup attach /sys/fs/cgroup/system.slice/myapp.service \

                     skb_egress my_prog.bpf.o sec egress

v1 vs v2: comparison

Property	v1	v2
Tree	per-controller	unified
Process in several?	yes (one per controller)	no
Threaded mode	no	yes (`cgroup.type` = threaded)
Soft memory limit	yes (memory.soft_limit_in_bytes)	no (use memory.high)
OOM behavior	OOM in the cgroup	plus memory.high throttling
PSI	no	yes
eBPF integration	minimal	first-class
Default in distros (2025)	RHEL 7-8, Ubuntu < 21	RHEL 9, Ubuntu 21+, everything new

Hybrid mode: the transition

systemd can run in hybrid mode: v1 for legacy controllers (cpuset, freezer) and v2 for the new ones. The file /sys/fs/cgroup/cgroup.controllers shows only the controllers in the v2 tree. This mode is the default in Ubuntu 20.04 and RHEL 8.

With systemd.unified_cgroup_hierarchy=1 (kernel cmdline), you get pure v2.

Check it:

stat -fc %T /sys/fs/cgroup/

# cgroup2fs = pure v2

# tmpfs = hybrid v1+v2

When something goes wrong

cgroup.subtree_control is empty. No controllers are enabled, so you cannot create a child with *.max files. Enable them in the parent first: echo "+cpu +memory" > cgroup.subtree_control.
No space left on device when creating a cgroup. This is kernel.threads-max or pids.max in the parent. Check both.
The container ignores the limits. An old container runtime that only knows v1. crun/containerd >= 1.5 and runc >= 1.0 know v2.
systemd-oomd kills the wrong thing. Check oomd.conf. By default it judges by 50% memory pressure.
cpu.weight=1000 gives no priority. Other cgroups in the parent raised their weight too. This is relative scheduling, so check the siblings.
PSI is zero everywhere. kernel < 4.20, or PSI is not enabled (CONFIG_PSI=y). It may be disabled on the cmdline with psi=0.
kubelet complains about the cgroup driver. A mismatch between the runtime (cgroupfs vs systemd). Set both to systemd.

Useful commands and files

systemd-cgls: the cgroup tree with units
systemd-cgtop: top by cgroup resource usage
systemctl set-property nginx.service MemoryMax=1G: a runtime change
cat /sys/fs/cgroup/.../cgroup.procs: which PIDs are in the cgroup
cat /proc/<pid>/cgroup: which cgroup a process is in
cgroup.events: notifications (low, high, max, oom, populated)

Why cgroups v2 exists

That design caused problems:

Inconsistent semantics across controllers. cpuset worked one way, memory another.
Hard to delegate control. A sub-tree held only a subset of the controllers.
No reliable way to describe "the limits for a process." You had to compute the intersection.
net_cls/net_prio became obsolete in favor of eBPF.

In 2016 (kernel 4.5), cgroups v2 arrived with a single tree and clean semantics. After five years of polish, cgroups v2 became the default in:

systemd 247+ (2020-) in hybrid mode
RHEL 9 (2022) as pure v2
Ubuntu 21.10+ as pure v2
Kubernetes 1.25+ with support
Docker 20.10+ with support

Unified hierarchy

A single tree under /sys/fs/cgroup/:

/sys/fs/cgroup/

├── cgroup.controllers          # available controllers

├── cgroup.subtree_control      # which are enabled for children

├── system.slice/

│   ├── cgroup.controllers

│   ├── cpu.weight

│   ├── memory.high

│   ├── memory.max

│   ├── nginx.service/

│   │   ├── cpu.stat

│   │   ├── memory.current

│   │   └── pids.current

│   └── postgresql.service/

│       └── ...

└── user.slice/

    └── user-1000.slice/

        └── session-1.scope/

Every node is a directory with the same set of files (which files depend on the enabled controllers).

A process lives in exactly one node of the tree (through cgroup.procs).

Controllers: what v2 offers

Controller	v1 equivalent	What it does
cpu	cpu+cpuacct	weight, max, period: CPU shares and a hard limit
cpuset	cpuset	pinning to CPUs and NUMA nodes
memory	memory	usage, low, high, max, soft limit
io	blkio	weight, max BW, BW limits
pids	pids	cap on the number of processes
rdma	rdma	InfiniBand resources
misc	-	gpu hpu: customizable limits
hugetlb	hugetlb	huge page allocations

v1 controllers that do not exist in v2 (deprecated):

net_cls, net_prio → moved into [[ebpf-basics|eBPF cgroup-attached programs]]
freezer → now through the cgroup.freeze file
devices → through eBPF (BPF_PROG_TYPE_CGROUP_DEVICE)
perf_event, cpuacct → folded into cpu

To enable a controller in a subtree, write to cgroup.subtree_control:

echo "+memory +pids" > /sys/fs/cgroup/system.slice/cgroup.subtree_control

Children now expose the memory.* and pids.* files.

Memory controller: new fields

Memory control in v2 is more flexible:

File	Semantics
`memory.current`	current usage, bytes
`memory.min`	reserved, never reclaimed (if there is any way to avoid it)
`memory.low`	best-effort protected, reclaimed only under pressure
`memory.high`	soft limit: the kernel starts throttling and reclaiming
`memory.max`	hard limit: exceeding it means OOM inside this cgroup
`memory.swap.max`	a separate swap limit
`memory.events`	counters: low, high, max, oom, oom_kill
`memory.stat`	detailed breakdown (anon, file, slab, sock, ...)

memory.high is the main new knob. Unlike the v1 hard limit:

When it is exceeded, the kernel slows the cgroup down (it sleeps on page faults) and actively reclaims page cache, but does not kill processes.
If the pressure lasts, the OOM killer (oom-killer) fires eventually.

This gives you graceful backpressure rather than an OOM-kill loop.

CPU controller: weight and max

cpu.weight       # 1-10000 (default 100), relative weight

cpu.max          # "max <quota> <period>", e.g. "100000 100000" = 1 CPU

cpu.stat         # usage_usec, user_usec, system_usec, throttled_usec

v1 had two controllers: cpu (weight) and cpuacct (statistics). v2 merges them.

I/O controller: weight, max, cost-based

io.weight        # default 100, relative

io.max           # rbps/wbps/riops/wiops limits per device

io.stat          # counters per device

io.cost.qos      # cost-based QoS (RHEL 9+)

io.cost (new) does not set limits in IOPS or bps. It uses a "cost budget" that accounts for device performance characteristics. It adapts better to different disks (NVMe vs HDD).

PSI: Pressure Stall Information

In v1 you could not answer: "is the system, or this specific cgroup, overloaded?" Load average was a single host-wide number, and imprecise.

PSI (Linux 4.20+) exposes three files on the root and on every cgroup:

/proc/pressure/cpu for CPU pressure
/proc/pressure/memory for memory
/proc/pressure/io for I/O

The format:

cat /sys/fs/cgroup/system.slice/postgresql.service/io.pressure

some avg10=12.34 avg60=8.91 avg300=5.12 total=12345678

full avg10=3.45  avg60=2.10 avg300=1.05 total=3456789

some: at least one task in the cgroup is waiting for the resource
full: all tasks are waiting (the cgroup is fully stalled)
avg10/60/300: percent of time over the last 10/60/300 seconds
total: total microseconds

Where you use it:

Auto-scaling Kubernetes on PSI instead of CPU%
Out-of-memory prediction. If memory.pressure full goes above 50%, an OOM is imminent.
systemd-oomd uses PSI for proactive killing well before the [[oom-killer|OOM]] would act.

systemd and cgroup delegation

systemd is the only writer in the cgroup tree by default:

system.slice holds system services
user.slice holds user sessions
machine.slice holds VMs and containers (machinectl)

Each unit gets its own cgroup. You set limits through unit properties:

[Service]

CPUWeight=200

MemoryHigh=512M

MemoryMax=1G

IOWeight=300

TasksMax=200

This is equivalent to writing cgroup.subtree_control plus *.max and weight.

To delegate a sub-tree to another process (Docker, k8s):

Delegate=yes

The process can then create sub-cgroups and change limits within its own sub-tree. This is what containers use.

eBPF plus cgroup: programmable control

v2 integrates with [[ebpf-basics|eBPF]] through cgroup-attached programs:

BPF_PROG_TYPE_CGROUP_SKB: L3/L4 filtering per cgroup
BPF_PROG_TYPE_CGROUP_SOCK: control at socket creation
BPF_PROG_TYPE_CGROUP_SOCKOPT: intercept setsockopt/getsockopt
BPF_PROG_TYPE_CGROUP_DEVICE: device whitelist
BPF_PROG_TYPE_LSM_CGROUP: LSM hook per cgroup

This replaces the old v1 controllers (devices, net_cls/net_prio). Cilium uses cgroup-eBPF for per-pod service routing.

You attach it through bpftool cgroup attach:

bpftool cgroup attach /sys/fs/cgroup/system.slice/myapp.service \

                     skb_egress my_prog.bpf.o sec egress

v1 vs v2: comparison

Property	v1	v2
Tree	per-controller	unified
Process in several?	yes (one per controller)	no
Threaded mode	no	yes (`cgroup.type` = threaded)
Soft memory limit	yes (memory.soft_limit_in_bytes)	no (use memory.high)
OOM behavior	OOM in the cgroup	plus memory.high throttling
PSI	no	yes
eBPF integration	minimal	first-class
Default in distros (2025)	RHEL 7-8, Ubuntu < 21	RHEL 9, Ubuntu 21+, everything new

Hybrid mode: the transition

With systemd.unified_cgroup_hierarchy=1 (kernel cmdline), you get pure v2.

Check it:

stat -fc %T /sys/fs/cgroup/

# cgroup2fs = pure v2

# tmpfs = hybrid v1+v2

When something goes wrong

cgroup.subtree_control is empty. No controllers are enabled, so you cannot create a child with *.max files. Enable them in the parent first: echo "+cpu +memory" > cgroup.subtree_control.
No space left on device when creating a cgroup. This is kernel.threads-max or pids.max in the parent. Check both.
The container ignores the limits. An old container runtime that only knows v1. crun/containerd >= 1.5 and runc >= 1.0 know v2.
systemd-oomd kills the wrong thing. Check oomd.conf. By default it judges by 50% memory pressure.
cpu.weight=1000 gives no priority. Other cgroups in the parent raised their weight too. This is relative scheduling, so check the siblings.
PSI is zero everywhere. kernel < 4.20, or PSI is not enabled (CONFIG_PSI=y). It may be disabled on the cmdline with psi=0.
kubelet complains about the cgroup driver. A mismatch between the runtime (cgroupfs vs systemd). Set both to systemd.

Useful commands and files

systemd-cgls: the cgroup tree with units
systemd-cgtop: top by cgroup resource usage
systemctl set-property nginx.service MemoryMax=1G: a runtime change
cat /sys/fs/cgroup/.../cgroup.procs: which PIDs are in the cgroup
cat /proc/<pid>/cgroup: which cgroup a process is in
cgroup.events: notifications (low, high, max, oom, populated)

cgroups v2: unified hierarchy, PSI, eBPF control

Why cgroups v2 exists

Unified hierarchy

Controllers: what v2 offers

Memory controller: new fields

CPU controller: weight and max

I/O controller: weight, max, cost-based

PSI: Pressure Stall Information

systemd and cgroup delegation

eBPF plus cgroup: programmable control

v1 vs v2: comparison

Hybrid mode: the transition

When something goes wrong

Useful commands and files

§ команды

§ см. также

cgroups v2: unified hierarchy, PSI, eBPF control

Why cgroups v2 exists

Unified hierarchy

Controllers: what v2 offers

Memory controller: new fields

CPU controller: weight and max

I/O controller: weight, max, cost-based

PSI: Pressure Stall Information

systemd and cgroup delegation

eBPF plus cgroup: programmable control

v1 vs v2: comparison

Hybrid mode: the transition

When something goes wrong

Useful commands and files

§ команды

§ см. также