Why namespaces exist
Containers are not virtual machines. They share one kernel with the host. Isolation comes from two mechanisms working together:
- cgroups limit how much resource a process may use
- namespaces limit what a process can see
The seven types
| namespace | what it isolates | unshare/clone flag |
|---|---|---|
| mnt | mount points (its own set of mounted filesystems) | CLONE_NEWNS |
| net | interfaces, routes, ARP, sockets, firewall | CLONE_NEWNET |
| pid | the PID tree; PID 1 in the ns is not PID 1 on the host | CLONE_NEWPID |
| user | UID/GID; root inside maps to unprivileged outside | CLONE_NEWUSER |
| uts | hostname, domainname | CLONE_NEWUTS |
| ipc | System V IPC, shared memory | CLONE_NEWIPC |
| cgroup | view of the cgroup tree (you see only your own subtree) | CLONE_NEWCGROUP |
| time | CLOCK_MONOTONIC offset (Linux 5.6+) | CLONE_NEWTIME |
How they get created
Three ways:
clone()/unshare()syscall: a program asks the kernel for a new namespaceip netns add NAME: creates a network namespace and mounts/run/netns/NAMEso you can refer to it later (see veth-pair)unshare CMD: a wrapper that doesunshare()plusexec()
What you see in /proc/<pid>/ns/
Each process is a set of namespace handles under /proc/<pid>/ns/:
ls -l /proc/self/ns/
# net -> 'net:[4026531992]'
# mnt -> 'mnt:[4026531840]'
# pid -> 'pid:[4026531836]'
# ...
The number in brackets is the inode id of the namespace. If two processes have
the same id for net, they are in the same network namespace. This is the first
way to diagnose which namespace you are in.
To enter the existing namespace of another process, use nsenter:
sudo nsenter -t <pid> -n -p ip addr # run ip addr in the net+pid ns of process <pid>
Connection to Docker
When you run docker run image, Docker:
- calls
unshare()with every flag excepttime - creates a veth-pair, leaves one end on the host in a bridge, and puts the other end in the new net namespace
- mounts an overlay filesystem as the process root
- places the process in cgroups for limits
- runs
execon the binary from the image
That is all. There is no VM and no hypervisor here. The isolation comes from namespaces plus cgroups.