Linux capabilities: privilege bits: capabilities

Why

Historically root was all-powerful and a plain user could do nothing. That is all-or-nothing. You want to run ping, so you need a raw socket, so you need root, but then the program can do EVERYTHING.

Capabilities split that "omnipotence" into about 40 bits, each one a specific right. A program needs only CAP_NET_RAW to create an ICMP socket, not the other 39 bits. In containers this matters a lot. --cap-drop=ALL plus --cap-add=NET_BIND_SERVICE gives you a nearly safe root.

The most common capabilities

CAP	What it allows
NET_ADMIN	`ip`, `tc`, `iptables`, `nft`, network sysctls
NET_RAW	raw sockets (ping, tcpdump)
NET_BIND_SERVICE	bind to ports <1024 (80, 443, 22)
SYS_ADMIN	mount, umount, swapon, and a pile of other things, almost like root
SYS_PTRACE	`ptrace` other processes (gdb, strace for other users)
SYS_TIME	change the system clock
SYS_NICE	priority, real-time scheduling
SYS_RESOURCE	raise `ulimit` values
CHOWN	change file ownership
DAC_OVERRIDE	bypass file-permissions on read and write
DAC_READ_SEARCH	the same, read only
SETUID / SETGID	change the process UID/GID
KILL	send signals to any process
MKNOD	create device nodes
AUDIT_WRITE	write to the audit log (needed for login)
BPF	load eBPF programs (newer)
PERFMON	`perf` without root (with CONFIG_BPF)

For the full list, read man 7 capabilities or run capsh --print.

Where a process keeps its capabilities

A process holds 5 sets of capabilities (as bitmasks in task_struct):

Permitted (P): what it MAY acquire
Effective (E): what is ACTIVE right now
Inheritable (I): what an exec'd program inherits
Bounding (B): the ceiling. Not even escalation can cross this set.
Ambient (A): inherited across exec without setuid (newer, for non-root users)

bash

cat /proc/self/status | grep ^Cap

# CapInh:	0000000000000000

# CapPrm:	0000003fffffffff   ← all 40 bits = root

# CapEff:	0000003fffffffff

# CapBnd:	0000003fffffffff

# CapAmb:	0000000000000000

capsh --print                              # readable format

capsh --decode=0000003fffffffff

A plain user has all zeros. Root has all ones. A container started with --cap-drop=ALL plus a few --cap-add shows a specific bitmask with those bits set.

File capabilities

A binary can carry capabilities in an xattr. At exec it then gets them without setuid:

bash

# Give /usr/bin/ping the CAP_NET_RAW right (how modern distros do it)

sudo setcap cap_net_raw+ep /usr/bin/ping

getcap /usr/bin/ping

# /usr/bin/ping = cap_net_raw+ep

# Remove

sudo setcap -r /usr/bin/ping

This is safer than SUID-root: the binary gets only the bit it needs, not full root privileges.

In Docker / containers

Docker gives a limited default set of caps out of the box:

CAP_AUDIT_WRITE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,

CAP_KILL, CAP_MKNOD, CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SETFCAP,

CAP_SETGID, CAP_SETPCAP, CAP_SETUID, CAP_SYS_CHROOT

Control them like this:

bash

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx     # minimum

docker run --cap-add=NET_ADMIN ubuntu                           # for tc/iptables

docker run --cap-add=SYS_PTRACE ubuntu                          # for strace

docker run --privileged ubuntu                                  # ALL caps + more (NOT safe)

In Kubernetes, use securityContext.capabilities:

yaml

securityContext:

  capabilities:

    drop: ["ALL"]

    add: ["NET_BIND_SERVICE"]

Debugging "Operation not permitted"

When a program fails with EPERM:

cat /proc/<pid>/status | grep ^CapEff shows what the process has
capsh --decode=<value> decodes it
Compare against what the operation needs (man 7 capabilities)
Inside a container, add it with --cap-add

CAP_SYS_ADMIN: the "new root"

For legacy reasons a great many operations require exactly SYS_ADMIN (mount, namespaces, BPF, MAC labels). It is root in practice. A container with CAP_SYS_ADMIN is unsafe almost by design.

This is a known problem, which is why CAP_BPF and CAP_PERFMON appeared. They carve specific rights out of SYS_ADMIN into their own bits.

Why

Historically root was all-powerful and a plain user could do nothing. That is all-or-nothing. You want to run ping, so you need a raw socket, so you need root, but then the program can do EVERYTHING.

The most common capabilities

CAP	What it allows
NET_ADMIN	`ip`, `tc`, `iptables`, `nft`, network sysctls
NET_RAW	raw sockets (ping, tcpdump)
NET_BIND_SERVICE	bind to ports <1024 (80, 443, 22)
SYS_ADMIN	mount, umount, swapon, and a pile of other things, almost like root
SYS_PTRACE	`ptrace` other processes (gdb, strace for other users)
SYS_TIME	change the system clock
SYS_NICE	priority, real-time scheduling
SYS_RESOURCE	raise `ulimit` values
CHOWN	change file ownership
DAC_OVERRIDE	bypass file-permissions on read and write
DAC_READ_SEARCH	the same, read only
SETUID / SETGID	change the process UID/GID
KILL	send signals to any process
MKNOD	create device nodes
AUDIT_WRITE	write to the audit log (needed for login)
BPF	load eBPF programs (newer)
PERFMON	`perf` without root (with CONFIG_BPF)

For the full list, read man 7 capabilities or run capsh --print.

Where a process keeps its capabilities

A process holds 5 sets of capabilities (as bitmasks in task_struct):

Permitted (P): what it MAY acquire
Effective (E): what is ACTIVE right now
Inheritable (I): what an exec'd program inherits
Bounding (B): the ceiling. Not even escalation can cross this set.
Ambient (A): inherited across exec without setuid (newer, for non-root users)

bash

cat /proc/self/status | grep ^Cap

# CapInh:	0000000000000000

# CapPrm:	0000003fffffffff   ← all 40 bits = root

# CapEff:	0000003fffffffff

# CapBnd:	0000003fffffffff

# CapAmb:	0000000000000000

capsh --print                              # readable format

capsh --decode=0000003fffffffff

A plain user has all zeros. Root has all ones. A container started with --cap-drop=ALL plus a few --cap-add shows a specific bitmask with those bits set.

File capabilities

A binary can carry capabilities in an xattr. At exec it then gets them without setuid:

bash

# Give /usr/bin/ping the CAP_NET_RAW right (how modern distros do it)

sudo setcap cap_net_raw+ep /usr/bin/ping

getcap /usr/bin/ping

# /usr/bin/ping = cap_net_raw+ep

# Remove

sudo setcap -r /usr/bin/ping

This is safer than SUID-root: the binary gets only the bit it needs, not full root privileges.

In Docker / containers

Docker gives a limited default set of caps out of the box:

CAP_AUDIT_WRITE, CAP_CHOWN, CAP_DAC_OVERRIDE, CAP_FOWNER, CAP_FSETID,

CAP_KILL, CAP_MKNOD, CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_SETFCAP,

CAP_SETGID, CAP_SETPCAP, CAP_SETUID, CAP_SYS_CHROOT

Control them like this:

bash

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE nginx     # minimum

docker run --cap-add=NET_ADMIN ubuntu                           # for tc/iptables

docker run --cap-add=SYS_PTRACE ubuntu                          # for strace

docker run --privileged ubuntu                                  # ALL caps + more (NOT safe)

In Kubernetes, use securityContext.capabilities:

yaml

securityContext:

  capabilities:

    drop: ["ALL"]

    add: ["NET_BIND_SERVICE"]

Debugging "Operation not permitted"

When a program fails with EPERM:

cat /proc/<pid>/status | grep ^CapEff shows what the process has
capsh --decode=<value> decodes it
Compare against what the operation needs (man 7 capabilities)
Inside a container, add it with --cap-add

CAP_SYS_ADMIN: the "new root"

For legacy reasons a great many operations require exactly SYS_ADMIN (mount, namespaces, BPF, MAC labels). It is root in practice. A container with CAP_SYS_ADMIN is unsafe almost by design.

This is a known problem, which is why CAP_BPF and CAP_PERFMON appeared. They carve specific rights out of SYS_ADMIN into their own bits.

Linux capabilities: privilege bits

Why

The most common capabilities

Where a process keeps its capabilities

File capabilities

In Docker / containers

Debugging "Operation not permitted"

CAP_SYS_ADMIN: the "new root"

§ команды

§ см. также

Linux capabilities: privilege bits

Why

The most common capabilities

Where a process keeps its capabilities

File capabilities

In Docker / containers

Debugging "Operation not permitted"

CAP_SYS_ADMIN: the "new root"

§ команды

§ см. также