Why
Historically root was all-powerful and a plain user could do nothing. That is all-or-nothing. You want to run ping, so you need a raw socket, so you need root, but then the program can do EVERYTHING.
Capabilities split that "omnipotence" into about 40 bits, each one a
specific right. A program needs only CAP_NET_RAW to create an ICMP socket,
not the other 39 bits. In containers this matters a lot. --cap-drop=ALL
plus --cap-add=NET_BIND_SERVICE gives you a nearly safe root.
The most common capabilities
| CAP | What it allows |
|---|---|
| NET_ADMIN | ip, tc, iptables, nft, network sysctls |
| NET_RAW | raw sockets (ping, tcpdump) |
| NET_BIND_SERVICE | bind to ports <1024 (80, 443, 22) |
| SYS_ADMIN | mount, umount, swapon, and a pile of other things, almost like root |
| SYS_PTRACE | ptrace other processes (gdb, strace for other users) |
| SYS_TIME | change the system clock |
| SYS_NICE | priority, real-time scheduling |
| SYS_RESOURCE | raise ulimit values |
| CHOWN | change file ownership |
| DAC_OVERRIDE | bypass file-permissions on read and write |
| DAC_READ_SEARCH | the same, read only |
| SETUID / SETGID | change the process UID/GID |
| KILL | send signals to any process |
| MKNOD | create device nodes |
| AUDIT_WRITE | write to the audit log (needed for login) |
| BPF | load eBPF programs (newer) |
| PERFMON | perf without root (with CONFIG_BPF) |
For the full list, read man 7 capabilities or run capsh --print.
Where a process keeps its capabilities
A process holds 5 sets of capabilities (as bitmasks in task_struct):
- Permitted (P): what it MAY acquire
- Effective (E): what is ACTIVE right now
- Inheritable (I): what an exec'd program inherits
- Bounding (B): the ceiling. Not even escalation can cross this set.
- Ambient (A): inherited across exec without setuid (newer, for non-root users)
cat /proc/self/status | grep ^Cap
# CapInh: 0000000000000000
# CapPrm: 0000003fffffffff ← all 40 bits = root
# CapEff: 0000003fffffffff
# CapBnd: 0000003fffffffff
# CapAmb: 0000000000000000
capsh --print # readable format
capsh --decode=0000003fffffffff
A plain user has all zeros. Root has all ones. A container started with
--cap-drop=ALL plus a few --cap-add shows a specific bitmask with those
bits set.
File capabilities
A binary can carry capabilities in an xattr. At exec it then gets them without setuid:
# Give /usr/bin/ping the CAP_NET_RAW right (how modern distros do it)
sudo setcap cap_net_raw+ep /usr/bin/ping
getcap /usr/bin/ping
# /usr/bin/ping = cap_net_raw+ep
# Remove
sudo setcap -r /usr/bin/ping
This is safer than SUID-root: the binary gets only the bit it needs, not full root privileges.
In Docker / containers
Docker gives a limited default set of caps out of the box:
CAP_AUDIT_WRITE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,
CAP_KILL, CAP_MKNOD, CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SETFCAP,
CAP_SETGID, CAP_SETPCAP, CAP_SETUID, CAP_SYS_CHROOT
Control them like this:
docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx # minimum
docker run --cap-add=NET_ADMIN ubuntu # for tc/iptables
docker run --cap-add=SYS_PTRACE ubuntu # for strace
docker run --privileged ubuntu # ALL caps + more (NOT safe)
In Kubernetes, use securityContext.capabilities:
securityContext:
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]
Debugging "Operation not permitted"
When a program fails with EPERM:
cat /proc/<pid>/status | grep ^CapEffshows what the process hascapsh --decode=<value>decodes it- Compare against what the operation needs (
man 7 capabilities) - Inside a container, add it with
--cap-add
CAP_SYS_ADMIN: the "new root"
For legacy reasons a great many operations require exactly SYS_ADMIN
(mount, namespaces, BPF, MAC labels). It is root in practice. A container
with CAP_SYS_ADMIN is unsafe almost by design.
This is a known problem, which is why CAP_BPF and CAP_PERFMON appeared.
They carve specific rights out of SYS_ADMIN into their own bits.