linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Containers (bonus)/runc-and-runsc

kb/containers ── Containers (bonus) ── advanced

runc, runsc, kata: container runtimes

runc is the standard OCI runtime: namespaces+cgroups+seccomp. runsc/gVisor is a userspace kernel for extra isolation. kata is a lightweight VM per container. Performance and isolation trade off against each other.

view as markdownaka: runc, runsc, kata, gvisor, container-runtime, low-level-runtime

What an OCI runtime is

It is the subsystem that takes an OCI bundle ([[oci-spec|spec]]: config.json + rootfs/) and starts the container. Exactly how it does that is its own choice; what matters is that it conforms to the OCI runtime spec.

Three popular options in 2026:

RuntimeApproachTrade-off
runcnamespaces + cgroups + seccomp in the host kernelmaximum performance, minimum isolation
runsc (gVisor)userspace kernel intercepts syscalls~30% slower, much more isolation
kata-containerseach container in a lightweight VM~5% overhead, VM-grade isolation
cruna runc alternative written in C, faster startupsame isolation as runc
youkirunc-compatible, written in Rustsame as runc

runc, the reference

Built by Docker/OCI as a minimal reference. The code is open and ships in every distro. It sits under all the common container stacks (Docker, containerd, CRI-O, podman), either as runc itself or its replacement (crun).

What runc does on runc run myctr:

  1. Reads config.json
  2. Creates [[namespaces|namespaces]] (PID, NET, MNT, IPC, UTS, USER)
  3. Sets up [[cgroups|cgroups]] (memory, cpu)
  4. Applies capabilities dropping (CAP_DROP)
  5. Applies a seccomp profile
  6. Applies an AppArmor/SELinux profile if one is set
  7. chroot into rootfs
  8. exec the command specified in the config

All of this happens in the host kernel. The container sees the host kernel, uses the same VFS, the same scheduler. The isolation comes from namespaces.

Running it directly without Docker:

bash
# Prepare the bundle
mkdir -p mycontainer/rootfs
cd mycontainer
docker export $(docker create alpine) | tar -C rootfs -xf -
runc spec                                  # creates config.json
# edit config.json to suit your needs
# Run
sudo runc run mycontainer-id
# Management
runc list
runc kill mycontainer-id KILL
runc delete mycontainer-id

This is the layer "below Docker". You use it when you want to understand what exactly happens, or for embedded scenarios.

runc, where it sits in the Docker stack

docker / podman
      │
      ▼
containerd (or CRI-O)
      │
      ▼
containerd-shim (one per container, survives a containerd restart)
      │
      ▼
runc (starts the init process, then exits)
      │
      ▼
the container's init process (PID 1 in the pid namespace)

The shim is needed to survive a restart of the higher-level managers. runc is short-lived: it does its job and dies.

crun, the C alternative

Same contract as runc, but:

  • Written in C (runc is Go), so startup is faster
  • Smaller memory footprint
  • Default in podman / RHEL 8+

A full drop-in replacement: a containerd config can switch from runc to crun and everything works.

Use it when you start many short-lived containers (CI, k8s jobs, function-as-a-service).

runsc / gVisor, a userspace kernel

The concept: place a userspace kernel (gVisor's "Sentry") between the application syscall and the host kernel, where it intercepts most syscalls and implements them itself.

app (inside the container)
      │ syscall
      ▼
Sentry (gVisor userspace kernel)
      │ a limited subset of host syscalls
      ▼
host kernel

Pros:

  • Not tied to the host kernel for most syscalls, so exploiting a kernel CVE is harder
  • Smaller attack surface: ~50 host syscalls instead of ~400
  • No VM, so startup is fast (a fraction of a second)

Cons:

  • Performance hit, 10-50% depending on the load
  • Not all syscalls work, edge networking/file features may not be supported (io_uring, for example, only partially)
  • Not every workload fits, a database with iouring or AIO will suffer

Running it:

bash
# Installation
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor ...
apt install runsc
# Register it with Docker
cat /etc/docker/daemon.json
{
  "runtimes": {
    "runsc": { "path": "/usr/bin/runsc" }
  }
}
systemctl restart docker
# Use it
docker run --runtime=runsc -it alpine

Where it is used:

  • Google App Engine / Cloud Run, internally
  • Untrusted code execution (online code playgrounds)
  • Multi-tenant CI, where a shared cluster runs other people's code

kata-containers, VM-based

Each container runs in a lightweight VM (via qemu/cloud-hypervisor/firecracker). Pros:

  • Hardware-grade isolation, a VM boundary, not a namespace boundary
  • Compatibility close to 100%, there is a real Linux kernel inside the VM
  • Support for GPU passthrough and custom kernels

Cons:

  • Overhead in RAM (~50-200 MB per container for the VM)
  • Slower startup, 1-2 sec instead of < 100ms
  • Nested virtualization is sometimes forbidden in the cloud
bash
# k8s through crio, runtimeClass
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
---
apiVersion: v1
kind: Pod
metadata:
  name: secure-pod
spec:
  runtimeClassName: kata
  containers:
  - name: app
    image: myapp:v1

Used in:

  • AWS Lambda + Firecracker, not Kata itself, but the same idea
  • Kata on AKS / Azure Container Instances
  • Confidential containers (CoCo), Kata + AMD SEV / Intel TDX for unencrypted-memory protection

Comparison

Propertyruncrunsc / gVisorkata-containers
Isolationnamespacesuserspace kernelVM
Performance100% (baseline)~70-90%~95%
Memory overhead~few MB~30 MB per Sentry~50-200 MB per VM
Startup~100 ms~150 ms~1-2 sec
Compatibility100%~85%~99%
Use casedefault everywhereuntrusted codemulti-tenant secure
Where defaultDocker, containerd, CRI-O, k8sGoogle Cloud RunOCI confidential

RuntimeClass in k8s

k8s allows multiple runtimes side-by-side:

yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc
---
apiVersion: v1
kind: Pod
spec:
  runtimeClassName: gvisor             # this pod runs through gVisor
  containers: [...]

The default is empty (== runc). Optionally you can force separate namespaces / labels onto the untrusted runtime.

When things go wrong

  • exec format error, a multi-arch image, the runtime starts a binary for the wrong architecture. Pull the correct platform.
  • OCI runtime exec failed: exec failed, the entrypoint does not exist or is not executable in the rootfs. chmod +x or check the path.
  • A runsc workload fails with unsupported syscall, runsc --strace or gVisor's dmesg will show which one; sometimes --platform=ptrace is a fallback (slower, broader compatibility).
  • Kata starts slowly, usually a cold-start of cloud-hypervisor. Set enable_template = true in configuration.toml for a prebooted VM.
  • runc-update does not work on cgroups, cgroupv1 vs v2 have different paths. Modern runc handles both, but containerd may not pass the new format.
  • Unknown runtime in Docker, it is not registered in /etc/docker/daemon.json, or systemctl restart docker was not run.

Alternatives and related

  • firecracker, a VMM, not a runtime, but Kata can use it
  • bubblewrap (bwrap), like runc for Flatpak; not OCI-compatible
  • lxc/lxd, older, not OCI; more "system contains" than "application contains"
  • systemd-nspawn, containerization built into systemd; also not OCI

§ команды

bash
runc spec

Generate a default config.json for the bundle, the starting point

bash
sudo runc run mycontainer

Run from the current directory (where config.json + rootfs/ live)

bash
runc list

All running runc containers on the host, a debug tool

bash
docker run --runtime=runsc -it alpine

Run through gVisor, higher isolation but not all syscalls work

bash
kubectl describe pod mypod | grep -i runtime

Which RuntimeClass a pod uses in k8s

bash
ctr run --runtime=io.containerd.runsc.v1 docker.io/alpine alpine sh

Run directly in containerd with an explicit runtime

bash
runsc --platform=ptrace run mycontainer

Force the ptrace platform, a fallback when KVM is unsupported or the host is kvm-unfriendly

§ см. также

  • oci-specOCI spec: the container standardOCI is three specs: Image (layers + manifest), Runtime (config.json + rootfs for runc), Distribution (registry API). The standard that followed Docker; runc, podman, containerd, CRI-O are all OCI-compatible.
  • namespacesLinux namespacesNamespaces are a kernel mechanism that gives a process its own isolated view of a resource (network, mount points, PID, UID, IPC, hostname, time). Every container is built on them.
  • cgroupscgroups (v2)cgroups v2 is a hierarchical virtual FS under `/sys/fs/cgroup` that the kernel uses to limit CPU, memory, and I/O for processes. Docker, k8s, and systemd write here.
  • seccompseccomp: a system call filterseccomp is a kernel-level syscall filter. A process declares "only these are allowed", and the kernel cuts off the rest. It anchors the Docker and Chrome sandbox.
  • kubernetes-pod-lifecycleKubernetes pod lifecycle: from Pending to TerminatedA pod moves through phases Pending, Running, Succeeded/Failed/Unknown. Init containers run sequentially before the main ones. Probes: startup, then readiness/liveness. SIGTERM plus a grace period on delete.
  • docker-storage-driversDocker storage drivers: overlay2, btrfs, zfsA storage driver is how Docker keeps image layers and container changes on disk. overlay2 is the default (overlayfs over ext4/xfs), btrfs and zfs work through subvolumes and snapshots, fuse-overlayfs is for rootless.
  • kubelet-internalskubelet: the Kubernetes node agent architecturekubelet is a daemon on every node. It receives the PodSpec through the API, starts containers through CRI, mounts volumes through CSI, and watches health. Under pressure it does eviction. Image GC and the cgroup tree are also its job.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies