linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Processes & resources/cgroups-v2-deep

kb/processes ── Processes & resources ── advanced

cgroups v2: unified hierarchy, PSI, eBPF control

cgroups v2 uses one tree instead of separate per-controller hierarchies. Clean semantics, new fields (memory.high, io.cost). PSI shows resource pressure. eBPF can manage resources. Default in RHEL 9, Ubuntu 22+.

view as markdownaka: cgroupv2, cgroups-v2, unified-hierarchy, psi, pressure-stall-info

Why cgroups v2 exists

[[cgroups|cgroups v1]] (since 2007) uses a per-controller hierarchy: each resource (cpu, memory, blkio, net_cls...) gets its own tree under /sys/fs/cgroup/<controller>/.... A process could sit in a different cgroup in each hierarchy.

That design caused problems:

  • Inconsistent semantics across controllers. cpuset worked one way, memory another.
  • Hard to delegate control. A sub-tree held only a subset of the controllers.
  • No reliable way to describe "the limits for a process." You had to compute the intersection.
  • net_cls/net_prio became obsolete in favor of eBPF.

In 2016 (kernel 4.5), cgroups v2 arrived with a single tree and clean semantics. After five years of polish, cgroups v2 became the default in:

  • systemd 247+ (2020-) in hybrid mode
  • RHEL 9 (2022) as pure v2
  • Ubuntu 21.10+ as pure v2
  • Kubernetes 1.25+ with support
  • Docker 20.10+ with support

Unified hierarchy

A single tree under /sys/fs/cgroup/:

/sys/fs/cgroup/
├── cgroup.controllers          # available controllers
├── cgroup.subtree_control      # which are enabled for children
├── system.slice/
│   ├── cgroup.controllers
│   ├── cpu.weight
│   ├── memory.high
│   ├── memory.max
│   ├── nginx.service/
│   │   ├── cpu.stat
│   │   ├── memory.current
│   │   └── pids.current
│   └── postgresql.service/
│       └── ...
└── user.slice/
    └── user-1000.slice/
        └── session-1.scope/

Every node is a directory with the same set of files (which files depend on the enabled controllers).

A process lives in exactly one node of the tree (through cgroup.procs).

Controllers: what v2 offers

Controllerv1 equivalentWhat it does
cpucpu+cpuacctweight, max, period: CPU shares and a hard limit
cpusetcpusetpinning to CPUs and NUMA nodes
memorymemoryusage, low, high, max, soft limit
ioblkioweight, max BW, BW limits
pidspidscap on the number of processes
rdmardmaInfiniBand resources
misc-gpu hpu: customizable limits
hugetlbhugetlbhuge page allocations

v1 controllers that do not exist in v2 (deprecated):

  • net_cls, net_prio → moved into [[ebpf-basics|eBPF cgroup-attached programs]]
  • freezer → now through the cgroup.freeze file
  • devices → through eBPF (BPF_PROG_TYPE_CGROUP_DEVICE)
  • perf_event, cpuacct → folded into cpu

To enable a controller in a subtree, write to cgroup.subtree_control:

echo "+memory +pids" > /sys/fs/cgroup/system.slice/cgroup.subtree_control

Children now expose the memory.* and pids.* files.

Memory controller: new fields

Memory control in v2 is more flexible:

FileSemantics
memory.currentcurrent usage, bytes
memory.minreserved, never reclaimed (if there is any way to avoid it)
memory.lowbest-effort protected, reclaimed only under pressure
memory.highsoft limit: the kernel starts throttling and reclaiming
memory.maxhard limit: exceeding it means OOM inside this cgroup
memory.swap.maxa separate swap limit
memory.eventscounters: low, high, max, oom, oom_kill
memory.statdetailed breakdown (anon, file, slab, sock, ...)

memory.high is the main new knob. Unlike the v1 hard limit:

  • When it is exceeded, the kernel slows the cgroup down (it sleeps on page faults) and actively reclaims page cache, but does not kill processes.
  • If the pressure lasts, the OOM killer (oom-killer) fires eventually.

This gives you graceful backpressure rather than an OOM-kill loop.

CPU controller: weight and max

cpu.weight       # 1-10000 (default 100), relative weight
cpu.max          # "max <quota> <period>", e.g. "100000 100000" = 1 CPU
cpu.stat         # usage_usec, user_usec, system_usec, throttled_usec

v1 had two controllers: cpu (weight) and cpuacct (statistics). v2 merges them.

I/O controller: weight, max, cost-based

io.weight        # default 100, relative
io.max           # rbps/wbps/riops/wiops limits per device
io.stat          # counters per device
io.cost.qos      # cost-based QoS (RHEL 9+)

io.cost (new) does not set limits in IOPS or bps. It uses a "cost budget" that accounts for device performance characteristics. It adapts better to different disks (NVMe vs HDD).

PSI: Pressure Stall Information

In v1 you could not answer: "is the system, or this specific cgroup, overloaded?" Load average was a single host-wide number, and imprecise.

PSI (Linux 4.20+) exposes three files on the root and on every cgroup:

  • /proc/pressure/cpu for CPU pressure
  • /proc/pressure/memory for memory
  • /proc/pressure/io for I/O

The format:

cat /sys/fs/cgroup/system.slice/postgresql.service/io.pressure
some avg10=12.34 avg60=8.91 avg300=5.12 total=12345678
full avg10=3.45  avg60=2.10 avg300=1.05 total=3456789
  • some: at least one task in the cgroup is waiting for the resource
  • full: all tasks are waiting (the cgroup is fully stalled)
  • avg10/60/300: percent of time over the last 10/60/300 seconds
  • total: total microseconds

Where you use it:

  • Auto-scaling Kubernetes on PSI instead of CPU%
  • Out-of-memory prediction. If memory.pressure full goes above 50%, an OOM is imminent.
  • systemd-oomd uses PSI for proactive killing well before the [[oom-killer|OOM]] would act.

systemd and cgroup delegation

systemd is the only writer in the cgroup tree by default:

  • system.slice holds system services
  • user.slice holds user sessions
  • machine.slice holds VMs and containers (machinectl)

Each unit gets its own cgroup. You set limits through unit properties:

[Service]
CPUWeight=200
MemoryHigh=512M
MemoryMax=1G
IOWeight=300
TasksMax=200

This is equivalent to writing cgroup.subtree_control plus *.max and weight.

To delegate a sub-tree to another process (Docker, k8s):

Delegate=yes

The process can then create sub-cgroups and change limits within its own sub-tree. This is what containers use.

eBPF plus cgroup: programmable control

v2 integrates with [[ebpf-basics|eBPF]] through cgroup-attached programs:

  • BPF_PROG_TYPE_CGROUP_SKB: L3/L4 filtering per cgroup
  • BPF_PROG_TYPE_CGROUP_SOCK: control at socket creation
  • BPF_PROG_TYPE_CGROUP_SOCKOPT: intercept setsockopt/getsockopt
  • BPF_PROG_TYPE_CGROUP_DEVICE: device whitelist
  • BPF_PROG_TYPE_LSM_CGROUP: LSM hook per cgroup

This replaces the old v1 controllers (devices, net_cls/net_prio). Cilium uses cgroup-eBPF for per-pod service routing.

You attach it through bpftool cgroup attach:

bpftool cgroup attach /sys/fs/cgroup/system.slice/myapp.service \
                     skb_egress my_prog.bpf.o sec egress

v1 vs v2: comparison

Propertyv1v2
Treeper-controllerunified
Process in several?yes (one per controller)no
Threaded modenoyes (cgroup.type = threaded)
Soft memory limityes (memory.soft_limit_in_bytes)no (use memory.high)
OOM behaviorOOM in the cgroupplus memory.high throttling
PSInoyes
eBPF integrationminimalfirst-class
Default in distros (2025)RHEL 7-8, Ubuntu < 21RHEL 9, Ubuntu 21+, everything new

Hybrid mode: the transition

systemd can run in hybrid mode: v1 for legacy controllers (cpuset, freezer) and v2 for the new ones. The file /sys/fs/cgroup/cgroup.controllers shows only the controllers in the v2 tree. This mode is the default in Ubuntu 20.04 and RHEL 8.

With systemd.unified_cgroup_hierarchy=1 (kernel cmdline), you get pure v2.

Check it:

stat -fc %T /sys/fs/cgroup/
# cgroup2fs = pure v2
# tmpfs = hybrid v1+v2

When something goes wrong

  • cgroup.subtree_control is empty. No controllers are enabled, so you cannot create a child with *.max files. Enable them in the parent first: echo "+cpu +memory" > cgroup.subtree_control.
  • No space left on device when creating a cgroup. This is kernel.threads-max or pids.max in the parent. Check both.
  • The container ignores the limits. An old container runtime that only knows v1. crun/containerd >= 1.5 and runc >= 1.0 know v2.
  • systemd-oomd kills the wrong thing. Check oomd.conf. By default it judges by 50% memory pressure.
  • cpu.weight=1000 gives no priority. Other cgroups in the parent raised their weight too. This is relative scheduling, so check the siblings.
  • PSI is zero everywhere. kernel < 4.20, or PSI is not enabled (CONFIG_PSI=y). It may be disabled on the cmdline with psi=0.
  • kubelet complains about the cgroup driver. A mismatch between the runtime (cgroupfs vs systemd). Set both to systemd.

Useful commands and files

  • systemd-cgls: the cgroup tree with units
  • systemd-cgtop: top by cgroup resource usage
  • systemctl set-property nginx.service MemoryMax=1G: a runtime change
  • cat /sys/fs/cgroup/.../cgroup.procs: which PIDs are in the cgroup
  • cat /proc/<pid>/cgroup: which cgroup a process is in
  • cgroup.events: notifications (low, high, max, oom, populated)

§ команды

bash
stat -fc %T /sys/fs/cgroup/

cgroup2fs = pure v2. tmpfs = hybrid v1+v2. You see it at once

bash
systemd-cgls

The cgroup tree with systemd units and processes

bash
systemd-cgtop

Top by cgroup resource usage (CPU%, memory, I/O)

bash
cat /proc/$$/cgroup

Which cgroup the current shell is in, format '0::/path'

bash
cat /sys/fs/cgroup/system.slice/nginx.service/memory.current

Current memory usage of a specific service

bash
cat /sys/fs/cgroup/system.slice/nginx.service/io.pressure

PSI for I/O: 'some' and 'full' over 10/60/300 seconds

bash
systemctl set-property nginx.service MemoryHigh=512M MemoryMax=1G

Runtime change to a service's limits (persists in /etc/systemd/system.control/)

bash
echo '+memory +io' > /sys/fs/cgroup/<slice>/cgroup.subtree_control

Enable controllers for the children of this cgroup

§ см. также

  • cgroupscgroups (v2)cgroups v2 is a hierarchical virtual FS under `/sys/fs/cgroup` that the kernel uses to limit CPU, memory, and I/O for processes. Docker, k8s, and systemd write here.
  • kubelet-internalskubelet: the Kubernetes node agent architecturekubelet is a daemon on every node. It receives the PodSpec through the API, starts containers through CRI, mounts volumes through CSI, and watches health. Under pressure it does eviction. Image GC and the cgroup tree are also its job.
  • oom-killerOOM killerOOM killer is the kernel mechanism that picks and terminates a process when the system hits its memory limit. In containers it works per-cgroup.
  • systemdsystemd: the init system and service managersystemd is the Linux init system: PID 1 that starts everything else, tracks dependencies, restarts what crashes, and collects the logs.
  • docker-storage-driversDocker storage drivers: overlay2, btrfs, zfsA storage driver is how Docker keeps image layers and container changes on disk. overlay2 is the default (overlayfs over ext4/xfs), btrfs and zfs work through subvolumes and snapshots, fuse-overlayfs is for rootless.
  • namespacesLinux namespacesNamespaces are a kernel mechanism that gives a process its own isolated view of a resource (network, mount points, PID, UID, IPC, hostname, time). Every container is built on them.
  • pyroscope-continuous-profilingContinuous profiling: Pyroscope, eBPF, flame graphs in productionContinuous profiling is an always-on CPU/memory profiler in production through eBPF. 1-2% overhead. Flame graphs show the hot path. Pyroscope (Grafana), Parca, Polar Signals. It replaces ad-hoc perf for production debugging.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies