Why cgroups v2 exists
[[cgroups|cgroups v1]] (since 2007) uses a per-controller hierarchy: each
resource (cpu, memory, blkio, net_cls...) gets its own tree under
/sys/fs/cgroup/<controller>/.... A process could sit in a different
cgroup in each hierarchy.
That design caused problems:
- Inconsistent semantics across controllers. cpuset worked one way, memory another.
- Hard to delegate control. A sub-tree held only a subset of the controllers.
- No reliable way to describe "the limits for a process." You had to compute the intersection.
- net_cls/net_prio became obsolete in favor of eBPF.
In 2016 (kernel 4.5), cgroups v2 arrived with a single tree and clean semantics. After five years of polish, cgroups v2 became the default in:
- systemd 247+ (2020-) in hybrid mode
- RHEL 9 (2022) as pure v2
- Ubuntu 21.10+ as pure v2
- Kubernetes 1.25+ with support
- Docker 20.10+ with support
Unified hierarchy
A single tree under /sys/fs/cgroup/:
/sys/fs/cgroup/
├── cgroup.controllers # available controllers
├── cgroup.subtree_control # which are enabled for children
├── system.slice/
│ ├── cgroup.controllers
│ ├── cpu.weight
│ ├── memory.high
│ ├── memory.max
│ ├── nginx.service/
│ │ ├── cpu.stat
│ │ ├── memory.current
│ │ └── pids.current
│ └── postgresql.service/
│ └── ...
└── user.slice/
└── user-1000.slice/
└── session-1.scope/
Every node is a directory with the same set of files (which files depend on the enabled controllers).
A process lives in exactly one node of the tree (through
cgroup.procs).
Controllers: what v2 offers
| Controller | v1 equivalent | What it does |
|---|---|---|
| cpu | cpu+cpuacct | weight, max, period: CPU shares and a hard limit |
| cpuset | cpuset | pinning to CPUs and NUMA nodes |
| memory | memory | usage, low, high, max, soft limit |
| io | blkio | weight, max BW, BW limits |
| pids | pids | cap on the number of processes |
| rdma | rdma | InfiniBand resources |
| misc | - | gpu hpu: customizable limits |
| hugetlb | hugetlb | huge page allocations |
v1 controllers that do not exist in v2 (deprecated):
net_cls,net_prio→ moved into [[ebpf-basics|eBPF cgroup-attached programs]]freezer→ now through thecgroup.freezefiledevices→ through eBPF (BPF_PROG_TYPE_CGROUP_DEVICE)perf_event,cpuacct→ folded into cpu
To enable a controller in a subtree, write to cgroup.subtree_control:
echo "+memory +pids" > /sys/fs/cgroup/system.slice/cgroup.subtree_control
Children now expose the memory.* and pids.* files.
Memory controller: new fields
Memory control in v2 is more flexible:
| File | Semantics |
|---|---|
memory.current | current usage, bytes |
memory.min | reserved, never reclaimed (if there is any way to avoid it) |
memory.low | best-effort protected, reclaimed only under pressure |
memory.high | soft limit: the kernel starts throttling and reclaiming |
memory.max | hard limit: exceeding it means OOM inside this cgroup |
memory.swap.max | a separate swap limit |
memory.events | counters: low, high, max, oom, oom_kill |
memory.stat | detailed breakdown (anon, file, slab, sock, ...) |
memory.high is the main new knob. Unlike the v1 hard limit:
- When it is exceeded, the kernel slows the cgroup down (it sleeps on page faults) and actively reclaims page cache, but does not kill processes.
- If the pressure lasts, the OOM killer (oom-killer) fires eventually.
This gives you graceful backpressure rather than an OOM-kill loop.
CPU controller: weight and max
cpu.weight # 1-10000 (default 100), relative weight
cpu.max # "max <quota> <period>", e.g. "100000 100000" = 1 CPU
cpu.stat # usage_usec, user_usec, system_usec, throttled_usec
v1 had two controllers: cpu (weight) and cpuacct (statistics). v2 merges them.
I/O controller: weight, max, cost-based
io.weight # default 100, relative
io.max # rbps/wbps/riops/wiops limits per device
io.stat # counters per device
io.cost.qos # cost-based QoS (RHEL 9+)
io.cost (new) does not set limits in IOPS or bps. It uses a "cost budget" that accounts for device performance characteristics. It adapts better to different disks (NVMe vs HDD).
PSI: Pressure Stall Information
In v1 you could not answer: "is the system, or this specific cgroup, overloaded?" Load average was a single host-wide number, and imprecise.
PSI (Linux 4.20+) exposes three files on the root and on every cgroup:
/proc/pressure/cpufor CPU pressure/proc/pressure/memoryfor memory/proc/pressure/iofor I/O
The format:
cat /sys/fs/cgroup/system.slice/postgresql.service/io.pressure
some avg10=12.34 avg60=8.91 avg300=5.12 total=12345678
full avg10=3.45 avg60=2.10 avg300=1.05 total=3456789
- some: at least one task in the cgroup is waiting for the resource
- full: all tasks are waiting (the cgroup is fully stalled)
- avg10/60/300: percent of time over the last 10/60/300 seconds
- total: total microseconds
Where you use it:
- Auto-scaling Kubernetes on PSI instead of CPU%
- Out-of-memory prediction. If memory.pressure full goes above 50%, an OOM is imminent.
- systemd-oomd uses PSI for proactive killing well before the [[oom-killer|OOM]] would act.
systemd and cgroup delegation
systemd is the only writer in the cgroup tree by default:
system.sliceholds system servicesuser.sliceholds user sessionsmachine.sliceholds VMs and containers (machinectl)
Each unit gets its own cgroup. You set limits through unit properties:
[Service]
CPUWeight=200
MemoryHigh=512M
MemoryMax=1G
IOWeight=300
TasksMax=200
This is equivalent to writing cgroup.subtree_control plus *.max and
weight.
To delegate a sub-tree to another process (Docker, k8s):
Delegate=yes
The process can then create sub-cgroups and change limits within its own sub-tree. This is what containers use.
eBPF plus cgroup: programmable control
v2 integrates with [[ebpf-basics|eBPF]] through cgroup-attached programs:
BPF_PROG_TYPE_CGROUP_SKB: L3/L4 filtering per cgroupBPF_PROG_TYPE_CGROUP_SOCK: control at socket creationBPF_PROG_TYPE_CGROUP_SOCKOPT: intercept setsockopt/getsockoptBPF_PROG_TYPE_CGROUP_DEVICE: device whitelistBPF_PROG_TYPE_LSM_CGROUP: LSM hook per cgroup
This replaces the old v1 controllers (devices, net_cls/net_prio).
Cilium uses cgroup-eBPF for per-pod service routing.
You attach it through bpftool cgroup attach:
bpftool cgroup attach /sys/fs/cgroup/system.slice/myapp.service \
skb_egress my_prog.bpf.o sec egress
v1 vs v2: comparison
| Property | v1 | v2 |
|---|---|---|
| Tree | per-controller | unified |
| Process in several? | yes (one per controller) | no |
| Threaded mode | no | yes (cgroup.type = threaded) |
| Soft memory limit | yes (memory.soft_limit_in_bytes) | no (use memory.high) |
| OOM behavior | OOM in the cgroup | plus memory.high throttling |
| PSI | no | yes |
| eBPF integration | minimal | first-class |
| Default in distros (2025) | RHEL 7-8, Ubuntu < 21 | RHEL 9, Ubuntu 21+, everything new |
Hybrid mode: the transition
systemd can run in hybrid mode: v1 for legacy controllers (cpuset,
freezer) and v2 for the new ones. The file
/sys/fs/cgroup/cgroup.controllers shows only the controllers in the
v2 tree. This mode is the default in Ubuntu 20.04 and RHEL 8.
With systemd.unified_cgroup_hierarchy=1 (kernel cmdline), you get pure v2.
Check it:
stat -fc %T /sys/fs/cgroup/
# cgroup2fs = pure v2
# tmpfs = hybrid v1+v2
When something goes wrong
cgroup.subtree_controlis empty. No controllers are enabled, so you cannot create a child with*.maxfiles. Enable them in the parent first:echo "+cpu +memory" > cgroup.subtree_control.- No space left on device when creating a cgroup. This is kernel.threads-max or pids.max in the parent. Check both.
- The container ignores the limits. An old container runtime that only knows v1. crun/containerd >= 1.5 and runc >= 1.0 know v2.
- systemd-oomd kills the wrong thing. Check
oomd.conf. By default it judges by 50% memory pressure. cpu.weight=1000gives no priority. Other cgroups in the parent raised their weight too. This is relative scheduling, so check the siblings.- PSI is zero everywhere. kernel < 4.20, or PSI is not enabled
(
CONFIG_PSI=y). It may be disabled on the cmdline withpsi=0. - kubelet complains about the cgroup driver. A mismatch between the
runtime (cgroupfs vs systemd). Set both to
systemd.
Useful commands and files
systemd-cgls: the cgroup tree with unitssystemd-cgtop: top by cgroup resource usagesystemctl set-property nginx.service MemoryMax=1G: a runtime changecat /sys/fs/cgroup/.../cgroup.procs: which PIDs are in the cgroupcat /proc/<pid>/cgroup: which cgroup a process is incgroup.events: notifications (low, high, max, oom, populated)