seccomp: a system call filter: seccomp

Why

capabilities split up root privileges. But even a plain user process can call about 350 different syscalls, and any of them may hold a bug that turns into a vulnerability. The best defense is to take away a process's ability to make syscalls it does not need.

A web server, for example, has no reason to call mount(), reboot(), kexec_load(), or ptrace(). If you can turn those OFF, you cut off a whole attack surface.

Two modes

SECCOMP_MODE_STRICT (the old one, 2005) leaves only read, write, _exit, and sigreturn. Too rigid, nobody uses it.
SECCOMP_MODE_FILTER (BPF, 2012) runs a BPF program that filters syscalls by number and arguments. This is what "seccomp" means today.

The BPF program takes the syscall number and arguments as input and returns one of these:

Action	What it does
`SECCOMP_RET_ALLOW`	let it through
`SECCOMP_RET_ERRNO(n)`	block it and return error `n` (typically EPERM)
`SECCOMP_RET_KILL_PROCESS`	kill the whole process
`SECCOMP_RET_KILL_THREAD`	kill only this thread
`SECCOMP_RET_TRAP`	SIGSYS, which you can handle
`SECCOMP_RET_LOG`	let it through and write to audit
`SECCOMP_RET_USER_NOTIF`	hand off to userspace for a decision (newer, for containers)

How a program turns seccomp on

Through the prctl() or seccomp() system call. Most programs use the libseccomp library so they do not have to write BPF by hand:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));   // default = block

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read),  0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit),  0);

seccomp_load(ctx);

// now any syscall except the allowed ones returns EPERM

A filter is irreversible: you can only narrow it, never widen it.

The Docker default profile

Docker applies a seccomp profile by default. About 50 syscalls are blocked: clone3 (a CVE mitigation, in the past), kexec_load, keyctl, reboot, mount (which also needs CAP_SYS_ADMIN), some ptrace variants, and so on.

bash

docker run --security-opt seccomp=unconfined ubuntu     # TURN OFF, for debugging

docker run --security-opt seccomp=/path/profile.json     # your own profile

In Kubernetes:

yaml

securityContext:

  seccompProfile:

    type: RuntimeDefault                         # docker-style profile

    # or

    type: Localhost

    localhostProfile: profiles/audit.json

How it relates to other mechanisms

AppArmor / SELinux filter at the MAC level (file/path/network)
capabilities define what root privileges are allowed
seccomp defines which syscalls are allowed

These are separate layers. In production you should enable all three plus namespace isolation. That is the "defense in depth" approach.

Debugging seccomp violations

When a process dies with EPERM "for no obvious reason", suspect seccomp:

bash

# 1. Look at the process filters

cat /proc/<pid>/status | grep ^Seccomp

# Seccomp:    2     ← 0=disabled, 1=strict, 2=filter

# 2. strace, to see the blocked syscall

strace -p <pid>          # error: Operation not permitted on a specific syscall

# 3. dmesg, if the profile is set to LOG

sudo dmesg | grep audit

To build your own Docker profile, run the application in RET_LOG mode (let everything through but log it), collect the list of syscalls it actually uses, then build a minimal whitelist.

System tools

scmp_sys_resolver <number> turns a syscall number into its name
seccomp-tools (third party) dumps the BPF from a running process
falco gives observability and audit with seccomp

Why

A web server, for example, has no reason to call mount(), reboot(), kexec_load(), or ptrace(). If you can turn those OFF, you cut off a whole attack surface.

Two modes

SECCOMP_MODE_STRICT (the old one, 2005) leaves only read, write, _exit, and sigreturn. Too rigid, nobody uses it.
SECCOMP_MODE_FILTER (BPF, 2012) runs a BPF program that filters syscalls by number and arguments. This is what "seccomp" means today.

The BPF program takes the syscall number and arguments as input and returns one of these:

Action	What it does
`SECCOMP_RET_ALLOW`	let it through
`SECCOMP_RET_ERRNO(n)`	block it and return error `n` (typically EPERM)
`SECCOMP_RET_KILL_PROCESS`	kill the whole process
`SECCOMP_RET_KILL_THREAD`	kill only this thread
`SECCOMP_RET_TRAP`	SIGSYS, which you can handle
`SECCOMP_RET_LOG`	let it through and write to audit
`SECCOMP_RET_USER_NOTIF`	hand off to userspace for a decision (newer, for containers)

How a program turns seccomp on

Through the prctl() or seccomp() system call. Most programs use the libseccomp library so they do not have to write BPF by hand:

scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));   // default = block

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read),  0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);

seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit),  0);

seccomp_load(ctx);

// now any syscall except the allowed ones returns EPERM

A filter is irreversible: you can only narrow it, never widen it.

The Docker default profile

bash

docker run --security-opt seccomp=unconfined ubuntu     # TURN OFF, for debugging

docker run --security-opt seccomp=/path/profile.json     # your own profile

In Kubernetes:

yaml

securityContext:

  seccompProfile:

    type: RuntimeDefault                         # docker-style profile

    # or

    type: Localhost

    localhostProfile: profiles/audit.json

How it relates to other mechanisms

AppArmor / SELinux filter at the MAC level (file/path/network)
capabilities define what root privileges are allowed
seccomp defines which syscalls are allowed

These are separate layers. In production you should enable all three plus namespace isolation. That is the "defense in depth" approach.

Debugging seccomp violations

When a process dies with EPERM "for no obvious reason", suspect seccomp:

bash

# 1. Look at the process filters

cat /proc/<pid>/status | grep ^Seccomp

# Seccomp:    2     ← 0=disabled, 1=strict, 2=filter

# 2. strace, to see the blocked syscall

strace -p <pid>          # error: Operation not permitted on a specific syscall

# 3. dmesg, if the profile is set to LOG

sudo dmesg | grep audit

To build your own Docker profile, run the application in RET_LOG mode (let everything through but log it), collect the list of syscalls it actually uses, then build a minimal whitelist.

System tools

scmp_sys_resolver <number> turns a syscall number into its name
seccomp-tools (third party) dumps the BPF from a running process
falco gives observability and audit with seccomp

seccomp: a system call filter

Why

Two modes

How a program turns seccomp on

The Docker default profile

How it relates to other mechanisms

Debugging seccomp violations

System tools

§ команды

§ см. также

seccomp: a system call filter

Why

Two modes

How a program turns seccomp on

The Docker default profile

How it relates to other mechanisms

Debugging seccomp violations

System tools

§ команды

§ см. также