linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Processes & resources/seccomp

kb/processes ── Processes & resources ── advanced

seccomp: a system call filter

seccomp is a kernel-level syscall filter. A process declares "only these are allowed", and the kernel cuts off the rest. It anchors the Docker and Chrome sandbox.

view as markdownaka: seccomp-bpf, syscall-filter, secure-computing

Why

capabilities split up root privileges. But even a plain user process can call about 350 different syscalls, and any of them may hold a bug that turns into a vulnerability. The best defense is to take away a process's ability to make syscalls it does not need.

A web server, for example, has no reason to call mount(), reboot(), kexec_load(), or ptrace(). If you can turn those OFF, you cut off a whole attack surface.

Two modes

  • SECCOMP_MODE_STRICT (the old one, 2005) leaves only read, write, _exit, and sigreturn. Too rigid, nobody uses it.
  • SECCOMP_MODE_FILTER (BPF, 2012) runs a BPF program that filters syscalls by number and arguments. This is what "seccomp" means today.

The BPF program takes the syscall number and arguments as input and returns one of these:

ActionWhat it does
SECCOMP_RET_ALLOWlet it through
SECCOMP_RET_ERRNO(n)block it and return error n (typically EPERM)
SECCOMP_RET_KILL_PROCESSkill the whole process
SECCOMP_RET_KILL_THREADkill only this thread
SECCOMP_RET_TRAPSIGSYS, which you can handle
SECCOMP_RET_LOGlet it through and write to audit
SECCOMP_RET_USER_NOTIFhand off to userspace for a decision (newer, for containers)

How a program turns seccomp on

Through the prctl() or seccomp() system call. Most programs use the libseccomp library so they do not have to write BPF by hand:

c
scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_ERRNO(EPERM));   // default = block
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(read),  0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 0);
seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit),  0);
seccomp_load(ctx);
// now any syscall except the allowed ones returns EPERM

A filter is irreversible: you can only narrow it, never widen it.

The Docker default profile

Docker applies a seccomp profile by default. About 50 syscalls are blocked: clone3 (a CVE mitigation, in the past), kexec_load, keyctl, reboot, mount (which also needs CAP_SYS_ADMIN), some ptrace variants, and so on.

bash
docker run --security-opt seccomp=unconfined ubuntu     # TURN OFF, for debugging
docker run --security-opt seccomp=/path/profile.json     # your own profile

In Kubernetes:

yaml
securityContext:
  seccompProfile:
    type: RuntimeDefault                         # docker-style profile
    # or
    type: Localhost
    localhostProfile: profiles/audit.json

How it relates to other mechanisms

  • AppArmor / SELinux filter at the MAC level (file/path/network)
  • capabilities define what root privileges are allowed
  • seccomp defines which syscalls are allowed

These are separate layers. In production you should enable all three plus namespace isolation. That is the "defense in depth" approach.

Debugging seccomp violations

When a process dies with EPERM "for no obvious reason", suspect seccomp:

bash
# 1. Look at the process filters
cat /proc/<pid>/status | grep ^Seccomp
# Seccomp:    2     ← 0=disabled, 1=strict, 2=filter
# 2. strace, to see the blocked syscall
strace -p <pid>          # error: Operation not permitted on a specific syscall
# 3. dmesg, if the profile is set to LOG
sudo dmesg | grep audit

To build your own Docker profile, run the application in RET_LOG mode (let everything through but log it), collect the list of syscalls it actually uses, then build a minimal whitelist.

System tools

  • scmp_sys_resolver <number> turns a syscall number into its name
  • seccomp-tools (third party) dumps the BPF from a running process
  • falco gives observability and audit with seccomp

§ команды

bash
cat /proc/<pid>/status | grep ^Seccomp

The seccomp state of a process (0/1/2)

bash
docker run --security-opt seccomp=unconfined ubuntu

Turn seccomp off for debugging, only locally, never in production

bash
docker run --security-opt seccomp=profile.json myimg

Apply a custom profile from JSON

bash
scmp_sys_resolver_x86_64 41

Turn a syscall number into its name (41 = socket)

bash
strace -e seccomp ./app

See in strace when a program applies its own seccomp filter

§ см. также

  • process-and-pidProcess and PIDA process is a running program with its own PID, memory, open descriptors, and UID. Every process forms a tree rooted at init (PID 1).
  • capabilitiesLinux capabilities: privilege bitsCapabilities split root's power into 40+ independent bits: NET_ADMIN, SYS_PTRACE, and so on. You can grant a process a slice of that power without making it full root.
  • namespacesLinux namespacesNamespaces are a kernel mechanism that gives a process its own isolated view of a resource (network, mount points, PID, UID, IPC, hostname, time). Every container is built on them.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies