Why
capabilities split up root privileges. But even a plain user process can call about 350 different syscalls, and any of them may hold a bug that turns into a vulnerability. The best defense is to take away a process's ability to make syscalls it does not need.
A web server, for example, has no reason to call mount(), reboot(),
kexec_load(), or ptrace(). If you can turn those OFF, you cut off a whole
attack surface.
Two modes
- SECCOMP_MODE_STRICT (the old one, 2005) leaves only
read,write,_exit, andsigreturn. Too rigid, nobody uses it. - SECCOMP_MODE_FILTER (BPF, 2012) runs a BPF program that filters syscalls by number and arguments. This is what "seccomp" means today.
The BPF program takes the syscall number and arguments as input and returns one of these:
| Action | What it does |
|---|---|
SECCOMP_RET_ALLOW | let it through |
SECCOMP_RET_ERRNO(n) | block it and return error n (typically EPERM) |
SECCOMP_RET_KILL_PROCESS | kill the whole process |
SECCOMP_RET_KILL_THREAD | kill only this thread |
SECCOMP_RET_TRAP | SIGSYS, which you can handle |
SECCOMP_RET_LOG | let it through and write to audit |
SECCOMP_RET_USER_NOTIF | hand off to userspace for a decision (newer, for containers) |
How a program turns seccomp on
Through the prctl() or seccomp() system call. Most programs use the
libseccomp library so they do not have to write BPF by hand:
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM)); // default = block
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit), 0);
seccomp_load(ctx);
// now any syscall except the allowed ones returns EPERM
A filter is irreversible: you can only narrow it, never widen it.
The Docker default profile
Docker applies a seccomp profile
by default. About 50 syscalls are blocked: clone3 (a CVE mitigation, in the
past), kexec_load, keyctl, reboot, mount (which also needs
CAP_SYS_ADMIN), some ptrace variants, and so on.
docker run --security-opt seccomp=unconfined ubuntu # TURN OFF, for debugging
docker run --security-opt seccomp=/path/profile.json # your own profile
In Kubernetes:
securityContext:
seccompProfile:
type: RuntimeDefault # docker-style profile
# or
type: Localhost
localhostProfile: profiles/audit.json
How it relates to other mechanisms
- AppArmor / SELinux filter at the MAC level (file/path/network)
- capabilities define what root privileges are allowed
- seccomp defines which syscalls are allowed
These are separate layers. In production you should enable all three plus namespace isolation. That is the "defense in depth" approach.
Debugging seccomp violations
When a process dies with EPERM "for no obvious reason", suspect seccomp:
# 1. Look at the process filters
cat /proc/<pid>/status | grep ^Seccomp
# Seccomp: 2 ← 0=disabled, 1=strict, 2=filter
# 2. strace, to see the blocked syscall
strace -p <pid> # error: Operation not permitted on a specific syscall
# 3. dmesg, if the profile is set to LOG
sudo dmesg | grep audit
To build your own Docker profile, run the application in RET_LOG mode
(let everything through but log it), collect the list of syscalls it actually
uses, then build a minimal whitelist.
System tools
scmp_sys_resolver <number>turns a syscall number into its nameseccomp-tools(third party) dumps the BPF from a running processfalcogives observability and audit with seccomp