kubelet: the Kubernetes node agent architecture

What kubelet is

kubelet is the only k8s component that works with containers on the node. Everything you see in kubectl get pod is the result of kubelet doing the following:

Subscribed to the PodSpecs for its node through kube-apiserver
Compares the desired state (from the API) with the actual state (what is really running)
Makes changes through CRI/CSI/CNI
Reports status back to the API

Do not confuse it with the container runtime (containerd). kubelet sits a level above: it commands the runtime through CRI gRPC.

kube-apiserver

      ▲ ▼ list/watch + status

kubelet (on every node)

│

      ├── CRI gRPC → containerd → [[runc-and-runsc|runc]] → container

      ├── CSI gRPC → csi-driver → mount/attach volume

      └── CNI exec → calico/cilium → pod network

The main subsystems

Pod sync loop

The heart of kubelet is syncLoop in pkg/kubelet/kubelet.go. An every-N-seconds loop (default 10s) or event-driven:

Get the desired set of pods from the sources: the API server, --pod-manifest-path (static pods for the control plane), an HTTP URL.
Compare with the running pods.
For each pod that should be running but is not, call syncPod.
For each one that is running but should not be, call killPod.

There are three sources:

api: the main one, a watch on kube-apiserver
file: static pods from /etc/kubernetes/manifests/ (apiserver, etcd, controller-manager start themselves this way on the control plane)
http: an external URL with a PodList

CRI, the Container Runtime Interface

This is the conversation with the container runtime over gRPC, through a socket, unix:///var/run/containerd/containerd.sock (or the CRI-O equivalent). Two services:

RuntimeService handles the pods/containers lifecycle (RunPodSandbox, CreateContainer, StartContainer, StopPodSandbox)
ImageService handles pull/list/remove image

A pod in CRI is a PodSandbox (a container with the pause image, holding the netns plus the ipc/uts namespace) plus N containers inside it, all sharing the netns with pause.

The pause container exists so that:

PID 1 in the pod namespace is always alive (to reap zombies)
A restart of the application container does not kill the pod network

Container Manager

It manages the cgroup tree on the node:

/sys/fs/cgroup/

├── system.slice/                    # systemd

├── kubepods.slice/

▸kubelet root

│   ├── kubepods-burstable.slice/    # QoS class

│   │   └── kubepods-burstable-pod<UID>.slice/

│   │       └── cri-containerd-<containerID>.scope/

│   ├── kubepods-besteffort.slice/

│   └── kubepods-pod<UID>.slice/     # Guaranteed

The hierarchy follows the pod's QoS class:

Guaranteed: requests == limits on all containers. Highest priority.
Burstable: requests < limits, or some containers without either
BestEffort: neither requests nor limits

When [[oom-killer|OOM]] hits the node, BestEffort goes first, then Burstable (by oom_score), and Guaranteed last.

Useful: --reserved-cpus, --cpu-manager-policy=static pin physical CPUs to Guaranteed pods.

Volume Manager

It reacts to pods with volumes by calling the CSI drivers (Attach → Mount → unmount → Detach). For CSI this is all gRPC to the socket of the CSI Node plugin on the node.

Details:

Mount/Unmount are idempotent, and kubelet retries on error
Attach/Detach is done not by kubelet but by the external-attacher sidecar when attachRequired: true in the CSI driver (but the decision is made by the A/D Controller in kube-controller-manager)
Mount flags (noatime, nodev) are set in the [[kubernetes-storage|StorageClass]]

Probes

Health checking of pods:

livenessProbe: is the pod alive? If it fails N times, then kill plus restart
readinessProbe: is the pod ready to accept traffic? If not, remove it from Endpoints (but do not kill it)
startupProbe: replaces liveness at startup, for slowly starting applications (so you avoid a kill during boot)

Probe types: httpGet, tcpSocket, exec (run a command inside the container, exit code 0 means ok), grpc (since k8s 1.27+).

The probe logic lives entirely in kubelet, with no network call to the API.

Image GC and disk eviction

kubelet does not keep unused images forever, or the disk would fill up. The algorithm:

Periodically (--image-gc-high-threshold, default 85%) it checks the usage of /var/lib/containerd (or the CRI equivalent). If it is above the threshold, it deletes unused images down to --image-gc-low-threshold (default 80%).

Ordering is LRU by time of last use. An image referenced by a running container is not deleted.

For logs: kubelet rotates stdout/stderr through CRI ( --container-log-max-size, default 10Mi, --container-log-max-files 5).

Node-pressure eviction

When the node's resources run low, kubelet proactively kills pods. It respects priority:

Signal	Default threshold	What gets evicted
`memory.available`	< 100Mi	BestEffort → Burstable → Guaranteed
`nodefs.available`	< 10%	same, by QoS
`nodefs.inodesFree`	< 5%	same
`imagefs.available`	< 15%	image GC first, then eviction
`pid.available`	< 10%	same

With soft eviction (--eviction-soft) and --eviction-soft-grace-period, the pod gets time for a graceful shutdown. Hard eviction is an immediate SIGKILL.

This is separate from the kernel's [[oom-killer|OOM killer]]: the kernel kills within seconds, while kubelet proactively evicts over minutes.

kubelet config

A ConfigMap-based config (instead of CLI flags):

yaml

apiVersion: kubelet.config.k8s.io/v1beta1

kind: KubeletConfiguration

cgroupDriver: systemd                  # must match containerd

containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock

clusterDNS: [10.96.0.10]

clusterDomain: cluster.local

podCIDR: 10.244.1.0/24                  # from the node spec

evictionHard:

  memory.available: "200Mi"

  nodefs.available: "10%"

systemReserved:

  cpu: "500m"

  memory: "1Gi"

kubeReserved:

  cpu: "500m"

  memory: "1Gi"

maxPods: 110

serializeImagePulls: false              # pull in parallel

cgroupDriver is a frequent pain point. systemd (the modern default) vs cgroupfs. It must match the containerd config, otherwise kubelet sees pods as "not its own" and chaos ensues.

Static pods

These are pods that kubelet runs independently of the API server, by reading YAML from /etc/kubernetes/manifests/. They are used for the control plane (kube-apiserver, etcd, controller-manager, scheduler, each of them a static pod on the control plane node).

In the API they appear as mirror pods (mirror pod), but kubelet does not listen to their edits, the source of truth is the file.

Useful: for a self-bootstrapping control plane (like kubeadm), and for standalone agents.

CSI / CNI / Device Plugin

kubelet calls out to three externals:

CSI over gRPC (mounted at the well-known path /var/lib/kubelet/plugins/<driver>/csi.sock)
CNI by exec of a binary (/opt/cni/bin/<plugin>)
Device Plugin over gRPC (for GPU, FPGA, RDMA; a DaemonSet registers with kubelet, and kubelet passes it through to the pod)

When things go wrong

Node not ready right after start: kubelet could not reach the CRI socket, or CNI is not configured (/etc/cni/net.d/ is empty). Check journalctl -u kubelet.
failed to get system container stats: failed to get cgroup stats: a cgroup driver mismatch (systemd vs cgroupfs) between kubelet and containerd.
Pod Evicted with no obvious reason: run kubectl describe node and check Conditions: (MemoryPressure/DiskPressure/PIDPressure). Most often it is a full disk from the image cache. Run crictl rmi --prune.
exec format error in the pause container: a multi-arch image does not suit the node. Check kubectl describe pod for the image arch.
Pod stuck in Terminating: a finalizer is holding it, or CRI is not answering StopPodSandbox. Run crictl ps to see whether the container exists, then crictl logs <id>.
kubelet OOMs on itself: too many pods and large PodSpecs in RAM. Tune --max-pods, add --kube-api-qps/burst.
CSI mount hangs: the csi-node-plugin pod has crashed. Check kubectl get pods -n kube-system | grep csi.
kubelet log shows syncPod errored: frequent and generic. It helps to use --v=4 for verbose output, but the logs balloon.

Useful diagnostic artifacts

journalctl -u kubelet -f, the main logs
crictl ps (instead of docker ps for CRI runtimes)
crictl logs <id>, container logs
crictl images, what is pulled on the node
cat /var/lib/kubelet/config.yaml, the current config
curl -k https://localhost:10250/metrics, Prometheus metrics
curl -k https://localhost:10250/healthz

What kubelet is

kubelet is the only k8s component that works with containers on the node. Everything you see in kubectl get pod is the result of kubelet doing the following:

Subscribed to the PodSpecs for its node through kube-apiserver
Compares the desired state (from the API) with the actual state (what is really running)
Makes changes through CRI/CSI/CNI
Reports status back to the API

Do not confuse it with the container runtime (containerd). kubelet sits a level above: it commands the runtime through CRI gRPC.

kube-apiserver

      ▲ ▼ list/watch + status

kubelet (on every node)

│

      ├── CRI gRPC → containerd → [[runc-and-runsc|runc]] → container

      ├── CSI gRPC → csi-driver → mount/attach volume

      └── CNI exec → calico/cilium → pod network

The main subsystems

Pod sync loop

The heart of kubelet is syncLoop in pkg/kubelet/kubelet.go. An every-N-seconds loop (default 10s) or event-driven:

Get the desired set of pods from the sources: the API server, --pod-manifest-path (static pods for the control plane), an HTTP URL.
Compare with the running pods.
For each pod that should be running but is not, call syncPod.
For each one that is running but should not be, call killPod.

There are three sources:

api: the main one, a watch on kube-apiserver
file: static pods from /etc/kubernetes/manifests/ (apiserver, etcd, controller-manager start themselves this way on the control plane)
http: an external URL with a PodList

CRI, the Container Runtime Interface

This is the conversation with the container runtime over gRPC, through a socket, unix:///var/run/containerd/containerd.sock (or the CRI-O equivalent). Two services:

RuntimeService handles the pods/containers lifecycle (RunPodSandbox, CreateContainer, StartContainer, StopPodSandbox)
ImageService handles pull/list/remove image

A pod in CRI is a PodSandbox (a container with the pause image, holding the netns plus the ipc/uts namespace) plus N containers inside it, all sharing the netns with pause.

The pause container exists so that:

PID 1 in the pod namespace is always alive (to reap zombies)
A restart of the application container does not kill the pod network

Container Manager

It manages the cgroup tree on the node:

/sys/fs/cgroup/

├── system.slice/                    # systemd

├── kubepods.slice/

▸kubelet root

│   ├── kubepods-burstable.slice/    # QoS class

│   │   └── kubepods-burstable-pod<UID>.slice/

│   │       └── cri-containerd-<containerID>.scope/

│   ├── kubepods-besteffort.slice/

│   └── kubepods-pod<UID>.slice/     # Guaranteed

The hierarchy follows the pod's QoS class:

Guaranteed: requests == limits on all containers. Highest priority.
Burstable: requests < limits, or some containers without either
BestEffort: neither requests nor limits

When [[oom-killer|OOM]] hits the node, BestEffort goes first, then Burstable (by oom_score), and Guaranteed last.

Useful: --reserved-cpus, --cpu-manager-policy=static pin physical CPUs to Guaranteed pods.

Volume Manager

It reacts to pods with volumes by calling the CSI drivers (Attach → Mount → unmount → Detach). For CSI this is all gRPC to the socket of the CSI Node plugin on the node.

Details:

Mount/Unmount are idempotent, and kubelet retries on error
Attach/Detach is done not by kubelet but by the external-attacher sidecar when attachRequired: true in the CSI driver (but the decision is made by the A/D Controller in kube-controller-manager)
Mount flags (noatime, nodev) are set in the [[kubernetes-storage|StorageClass]]

Probes

Health checking of pods:

livenessProbe: is the pod alive? If it fails N times, then kill plus restart
readinessProbe: is the pod ready to accept traffic? If not, remove it from Endpoints (but do not kill it)
startupProbe: replaces liveness at startup, for slowly starting applications (so you avoid a kill during boot)

Probe types: httpGet, tcpSocket, exec (run a command inside the container, exit code 0 means ok), grpc (since k8s 1.27+).

The probe logic lives entirely in kubelet, with no network call to the API.

Image GC and disk eviction

kubelet does not keep unused images forever, or the disk would fill up. The algorithm:

Ordering is LRU by time of last use. An image referenced by a running container is not deleted.

For logs: kubelet rotates stdout/stderr through CRI ( --container-log-max-size, default 10Mi, --container-log-max-files 5).

Node-pressure eviction

When the node's resources run low, kubelet proactively kills pods. It respects priority:

Signal	Default threshold	What gets evicted
`memory.available`	< 100Mi	BestEffort → Burstable → Guaranteed
`nodefs.available`	< 10%	same, by QoS
`nodefs.inodesFree`	< 5%	same
`imagefs.available`	< 15%	image GC first, then eviction
`pid.available`	< 10%	same

With soft eviction (--eviction-soft) and --eviction-soft-grace-period, the pod gets time for a graceful shutdown. Hard eviction is an immediate SIGKILL.

This is separate from the kernel's [[oom-killer|OOM killer]]: the kernel kills within seconds, while kubelet proactively evicts over minutes.

kubelet config

A ConfigMap-based config (instead of CLI flags):

yaml

apiVersion: kubelet.config.k8s.io/v1beta1

kind: KubeletConfiguration

cgroupDriver: systemd                  # must match containerd

containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock

clusterDNS: [10.96.0.10]

clusterDomain: cluster.local

podCIDR: 10.244.1.0/24                  # from the node spec

evictionHard:

  memory.available: "200Mi"

  nodefs.available: "10%"

systemReserved:

  cpu: "500m"

  memory: "1Gi"

kubeReserved:

  cpu: "500m"

  memory: "1Gi"

maxPods: 110

serializeImagePulls: false              # pull in parallel

cgroupDriver is a frequent pain point. systemd (the modern default) vs cgroupfs. It must match the containerd config, otherwise kubelet sees pods as "not its own" and chaos ensues.

Static pods

In the API they appear as mirror pods (mirror pod), but kubelet does not listen to their edits, the source of truth is the file.

Useful: for a self-bootstrapping control plane (like kubeadm), and for standalone agents.

CSI / CNI / Device Plugin

kubelet calls out to three externals:

CSI over gRPC (mounted at the well-known path /var/lib/kubelet/plugins/<driver>/csi.sock)
CNI by exec of a binary (/opt/cni/bin/<plugin>)
Device Plugin over gRPC (for GPU, FPGA, RDMA; a DaemonSet registers with kubelet, and kubelet passes it through to the pod)

When things go wrong

Node not ready right after start: kubelet could not reach the CRI socket, or CNI is not configured (/etc/cni/net.d/ is empty). Check journalctl -u kubelet.
failed to get system container stats: failed to get cgroup stats: a cgroup driver mismatch (systemd vs cgroupfs) between kubelet and containerd.
Pod Evicted with no obvious reason: run kubectl describe node and check Conditions: (MemoryPressure/DiskPressure/PIDPressure). Most often it is a full disk from the image cache. Run crictl rmi --prune.
exec format error in the pause container: a multi-arch image does not suit the node. Check kubectl describe pod for the image arch.
Pod stuck in Terminating: a finalizer is holding it, or CRI is not answering StopPodSandbox. Run crictl ps to see whether the container exists, then crictl logs <id>.
kubelet OOMs on itself: too many pods and large PodSpecs in RAM. Tune --max-pods, add --kube-api-qps/burst.
CSI mount hangs: the csi-node-plugin pod has crashed. Check kubectl get pods -n kube-system | grep csi.
kubelet log shows syncPod errored: frequent and generic. It helps to use --v=4 for verbose output, but the logs balloon.

Useful diagnostic artifacts

journalctl -u kubelet -f, the main logs
crictl ps (instead of docker ps for CRI runtimes)
crictl logs <id>, container logs
crictl images, what is pulled on the node
cat /var/lib/kubelet/config.yaml, the current config
curl -k https://localhost:10250/metrics, Prometheus metrics
curl -k https://localhost:10250/healthz

kubelet: the Kubernetes node agent architecture

What kubelet is

The main subsystems

Pod sync loop

CRI, the Container Runtime Interface

Container Manager

Volume Manager

Probes

Image GC and disk eviction

Node-pressure eviction

kubelet config

Static pods

CSI / CNI / Device Plugin

When things go wrong

Useful diagnostic artifacts

§ команды

§ см. также

kubelet: the Kubernetes node agent architecture

What kubelet is

The main subsystems

Pod sync loop

CRI, the Container Runtime Interface

Container Manager

Volume Manager

Probes

Image GC and disk eviction

Node-pressure eviction

kubelet config

Static pods

CSI / CNI / Device Plugin

When things go wrong

Useful diagnostic artifacts

§ команды

§ см. также