What kubelet is
kubelet is the only k8s component that works with
containers on the node. Everything you see in kubectl get pod
is the result of kubelet doing the following:
- Subscribed to the PodSpecs for its node through kube-apiserver
- Compares the desired state (from the API) with the actual state (what is really running)
- Makes changes through CRI/CSI/CNI
- Reports status back to the API
Do not confuse it with the container runtime (containerd). kubelet sits a level above: it commands the runtime through CRI gRPC.
kube-apiserver
▲ ▼ list/watch + status
kubelet (on every node)
│
├── CRI gRPC → containerd → [[runc-and-runsc|runc]] → container
├── CSI gRPC → csi-driver → mount/attach volume
└── CNI exec → calico/cilium → pod network
The main subsystems
Pod sync loop
The heart of kubelet is syncLoop in pkg/kubelet/kubelet.go.
An every-N-seconds loop (default 10s) or event-driven:
- Get the desired set of pods from the sources: the API server,
--pod-manifest-path(static pods for the control plane), an HTTP URL. - Compare with the running pods.
- For each pod that should be running but is not, call
syncPod. - For each one that is running but should not be, call
killPod.
There are three sources:
- api: the main one, a watch on kube-apiserver
- file: static pods from
/etc/kubernetes/manifests/(apiserver, etcd, controller-manager start themselves this way on the control plane) - http: an external URL with a PodList
CRI, the Container Runtime Interface
This is the conversation with the container runtime over gRPC, through a socket,
unix:///var/run/containerd/containerd.sock (or the CRI-O equivalent). Two services:
- RuntimeService handles the pods/containers lifecycle (RunPodSandbox, CreateContainer, StartContainer, StopPodSandbox)
- ImageService handles pull/list/remove image
A pod in CRI is a PodSandbox (a container with the pause image, holding
the netns plus the ipc/uts namespace) plus N containers inside it,
all sharing the netns with pause.
The pause container exists so that:
- PID 1 in the pod namespace is always alive (to reap zombies)
- A restart of the application container does not kill the pod network
Container Manager
It manages the cgroup tree on the node:
/sys/fs/cgroup/
├── system.slice/ # systemd
├── kubepods.slice/
▸kubelet root
│ ├── kubepods-burstable.slice/ # QoS class
│ │ └── kubepods-burstable-pod<UID>.slice/
│ │ └── cri-containerd-<containerID>.scope/
│ ├── kubepods-besteffort.slice/
│ └── kubepods-pod<UID>.slice/ # Guaranteed
The hierarchy follows the pod's QoS class:
- Guaranteed: requests == limits on all containers. Highest priority.
- Burstable: requests < limits, or some containers without either
- BestEffort: neither requests nor limits
When [[oom-killer|OOM]] hits the node, BestEffort goes first, then Burstable (by oom_score), and Guaranteed last.
Useful: --reserved-cpus, --cpu-manager-policy=static
pin physical CPUs to Guaranteed pods.
Volume Manager
It reacts to pods with volumes by calling the CSI drivers (Attach → Mount → unmount → Detach). For CSI this is all gRPC to the socket of the CSI Node plugin on the node.
Details:
- Mount/Unmount are idempotent, and kubelet retries on error
- Attach/Detach is done not by kubelet but by the external-attacher sidecar
when
attachRequired: truein the CSI driver (but the decision is made by the A/D Controller in kube-controller-manager) - Mount flags (
noatime,nodev) are set in the [[kubernetes-storage|StorageClass]]
Probes
Health checking of pods:
- livenessProbe: is the pod alive? If it fails N times, then kill plus restart
- readinessProbe: is the pod ready to accept traffic? If not, remove it from Endpoints (but do not kill it)
- startupProbe: replaces liveness at startup, for slowly starting applications (so you avoid a kill during boot)
Probe types: httpGet, tcpSocket, exec (run a command
inside the container, exit code 0 means ok), grpc (since k8s 1.27+).
The probe logic lives entirely in kubelet, with no network call to the API.
Image GC and disk eviction
kubelet does not keep unused images forever, or the disk would fill up. The algorithm:
Periodically (--image-gc-high-threshold, default 85%) it checks the
usage of /var/lib/containerd (or the CRI equivalent). If it is
above the threshold, it deletes unused images down to --image-gc-low-threshold
(default 80%).
Ordering is LRU by time of last use. An image referenced by a running container is not deleted.
For logs: kubelet rotates stdout/stderr through CRI (
--container-log-max-size, default 10Mi, --container-log-max-files 5).
Node-pressure eviction
When the node's resources run low, kubelet proactively kills pods. It respects priority:
| Signal | Default threshold | What gets evicted |
|---|---|---|
memory.available | < 100Mi | BestEffort → Burstable → Guaranteed |
nodefs.available | < 10% | same, by QoS |
nodefs.inodesFree | < 5% | same |
imagefs.available | < 15% | image GC first, then eviction |
pid.available | < 10% | same |
With soft eviction (--eviction-soft) and --eviction-soft-grace-period,
the pod gets time for a graceful shutdown. Hard eviction is an immediate SIGKILL.
This is separate from the kernel's [[oom-killer|OOM killer]]: the kernel kills within seconds, while kubelet proactively evicts over minutes.
kubelet config
A ConfigMap-based config (instead of CLI flags):
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd # must match containerd
containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock
clusterDNS: [10.96.0.10]
clusterDomain: cluster.local
podCIDR: 10.244.1.0/24 # from the node spec
evictionHard:
memory.available: "200Mi"
nodefs.available: "10%"
systemReserved:
cpu: "500m"
memory: "1Gi"
kubeReserved:
cpu: "500m"
memory: "1Gi"
maxPods: 110
serializeImagePulls: false # pull in parallel
cgroupDriver is a frequent pain point. systemd (the modern default) vs cgroupfs. It must match the containerd config, otherwise kubelet sees pods as "not its own" and chaos ensues.
Static pods
These are pods that kubelet runs independently of the API server,
by reading YAML from /etc/kubernetes/manifests/. They are used for the
control plane (kube-apiserver, etcd, controller-manager,
scheduler, each of them a static pod on the control plane node).
In the API they appear as mirror pods (mirror pod), but kubelet
does not listen to their edits, the source of truth is the file.
Useful: for a self-bootstrapping control plane (like kubeadm), and for standalone agents.
CSI / CNI / Device Plugin
kubelet calls out to three externals:
- CSI over gRPC (mounted at the well-known path
/var/lib/kubelet/plugins/<driver>/csi.sock) - CNI by exec of a binary (
/opt/cni/bin/<plugin>) - Device Plugin over gRPC (for GPU, FPGA, RDMA; a DaemonSet registers with kubelet, and kubelet passes it through to the pod)
When things go wrong
Node not readyright after start: kubelet could not reach the CRI socket, or CNI is not configured (/etc/cni/net.d/is empty). Checkjournalctl -u kubelet.failed to get system container stats: failed to get cgroup stats: a cgroup driver mismatch (systemd vs cgroupfs) between kubelet and containerd.- Pod
Evictedwith no obvious reason: runkubectl describe nodeand checkConditions:(MemoryPressure/DiskPressure/PIDPressure). Most often it is a full disk from the image cache. Runcrictl rmi --prune. exec format errorin the pause container: a multi-arch image does not suit the node. Checkkubectl describe podfor the image arch.- Pod stuck in Terminating: a finalizer is holding it, or CRI is not
answering StopPodSandbox. Run
crictl psto see whether the container exists, thencrictl logs <id>. - kubelet OOMs on itself: too many pods and large
PodSpecs in RAM. Tune
--max-pods, add--kube-api-qps/burst. - CSI mount hangs: the csi-node-plugin pod has crashed. Check
kubectl get pods -n kube-system | grep csi. - kubelet log shows
syncPod errored: frequent and generic. It helps to use--v=4for verbose output, but the logs balloon.
Useful diagnostic artifacts
journalctl -u kubelet -f, the main logscrictl ps(instead ofdocker psfor CRI runtimes)crictl logs <id>, container logscrictl images, what is pulled on the nodecat /var/lib/kubelet/config.yaml, the current configcurl -k https://localhost:10250/metrics, Prometheus metricscurl -k https://localhost:10250/healthz