linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Lessons
  • How it works
  • Simulator
  • Knowledge base
  • Interview prep
Index
Categories
All entries
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
home/linux/kb/Containers (bonus)/kubelet-internals

kb/containers ── Containers (bonus) ── advanced

kubelet: the Kubernetes node agent architecture

kubelet is a daemon on every node. It receives the PodSpec through the API, starts containers through CRI, mounts volumes through CSI, and watches health. Under pressure it does eviction. Image GC and the cgroup tree are also its job.

view as markdownaka: kubelet, k8s-kubelet, node-agent, cri, container-manager

What kubelet is

kubelet is the only k8s component that works with containers on the node. Everything you see in kubectl get pod is the result of kubelet doing the following:

  1. Subscribed to the PodSpecs for its node through kube-apiserver
  2. Compares the desired state (from the API) with the actual state (what is really running)
  3. Makes changes through CRI/CSI/CNI
  4. Reports status back to the API

Do not confuse it with the container runtime (containerd). kubelet sits a level above: it commands the runtime through CRI gRPC.

kube-apiserver
      ▲ ▼ list/watch + status
kubelet (on every node)
      │
      ├── CRI gRPC → containerd → [[runc-and-runsc|runc]] → container
      ├── CSI gRPC → csi-driver → mount/attach volume
      └── CNI exec → calico/cilium → pod network

The main subsystems

Pod sync loop

The heart of kubelet is syncLoop in pkg/kubelet/kubelet.go. An every-N-seconds loop (default 10s) or event-driven:

  1. Get the desired set of pods from the sources: the API server, --pod-manifest-path (static pods for the control plane), an HTTP URL.
  2. Compare with the running pods.
  3. For each pod that should be running but is not, call syncPod.
  4. For each one that is running but should not be, call killPod.

There are three sources:

  • api: the main one, a watch on kube-apiserver
  • file: static pods from /etc/kubernetes/manifests/ (apiserver, etcd, controller-manager start themselves this way on the control plane)
  • http: an external URL with a PodList

CRI, the Container Runtime Interface

This is the conversation with the container runtime over gRPC, through a socket, unix:///var/run/containerd/containerd.sock (or the CRI-O equivalent). Two services:

  • RuntimeService handles the pods/containers lifecycle (RunPodSandbox, CreateContainer, StartContainer, StopPodSandbox)
  • ImageService handles pull/list/remove image

A pod in CRI is a PodSandbox (a container with the pause image, holding the netns plus the ipc/uts namespace) plus N containers inside it, all sharing the netns with pause.

The pause container exists so that:

  • PID 1 in the pod namespace is always alive (to reap zombies)
  • A restart of the application container does not kill the pod network

Container Manager

It manages the cgroup tree on the node:

/sys/fs/cgroup/
├── system.slice/                    # systemd
├── kubepods.slice/                 

▸kubelet root

│   ├── kubepods-burstable.slice/    # QoS class
│   │   └── kubepods-burstable-pod<UID>.slice/
│   │       └── cri-containerd-<containerID>.scope/
│   ├── kubepods-besteffort.slice/
│   └── kubepods-pod<UID>.slice/     # Guaranteed

The hierarchy follows the pod's QoS class:

  • Guaranteed: requests == limits on all containers. Highest priority.
  • Burstable: requests < limits, or some containers without either
  • BestEffort: neither requests nor limits

When [[oom-killer|OOM]] hits the node, BestEffort goes first, then Burstable (by oom_score), and Guaranteed last.

Useful: --reserved-cpus, --cpu-manager-policy=static pin physical CPUs to Guaranteed pods.

Volume Manager

It reacts to pods with volumes by calling the CSI drivers (Attach → Mount → unmount → Detach). For CSI this is all gRPC to the socket of the CSI Node plugin on the node.

Details:

  • Mount/Unmount are idempotent, and kubelet retries on error
  • Attach/Detach is done not by kubelet but by the external-attacher sidecar when attachRequired: true in the CSI driver (but the decision is made by the A/D Controller in kube-controller-manager)
  • Mount flags (noatime, nodev) are set in the [[kubernetes-storage|StorageClass]]

Probes

Health checking of pods:

  • livenessProbe: is the pod alive? If it fails N times, then kill plus restart
  • readinessProbe: is the pod ready to accept traffic? If not, remove it from Endpoints (but do not kill it)
  • startupProbe: replaces liveness at startup, for slowly starting applications (so you avoid a kill during boot)

Probe types: httpGet, tcpSocket, exec (run a command inside the container, exit code 0 means ok), grpc (since k8s 1.27+).

The probe logic lives entirely in kubelet, with no network call to the API.

Image GC and disk eviction

kubelet does not keep unused images forever, or the disk would fill up. The algorithm:

Periodically (--image-gc-high-threshold, default 85%) it checks the usage of /var/lib/containerd (or the CRI equivalent). If it is above the threshold, it deletes unused images down to --image-gc-low-threshold (default 80%).

Ordering is LRU by time of last use. An image referenced by a running container is not deleted.

For logs: kubelet rotates stdout/stderr through CRI ( --container-log-max-size, default 10Mi, --container-log-max-files 5).

Node-pressure eviction

When the node's resources run low, kubelet proactively kills pods. It respects priority:

SignalDefault thresholdWhat gets evicted
memory.available< 100MiBestEffort → Burstable → Guaranteed
nodefs.available< 10%same, by QoS
nodefs.inodesFree< 5%same
imagefs.available< 15%image GC first, then eviction
pid.available< 10%same

With soft eviction (--eviction-soft) and --eviction-soft-grace-period, the pod gets time for a graceful shutdown. Hard eviction is an immediate SIGKILL.

This is separate from the kernel's [[oom-killer|OOM killer]]: the kernel kills within seconds, while kubelet proactively evicts over minutes.

kubelet config

A ConfigMap-based config (instead of CLI flags):

yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd                  # must match containerd
containerRuntimeEndpoint: unix:///var/run/containerd/containerd.sock
clusterDNS: [10.96.0.10]
clusterDomain: cluster.local
podCIDR: 10.244.1.0/24                  # from the node spec
evictionHard:
  memory.available: "200Mi"
  nodefs.available: "10%"
systemReserved:
  cpu: "500m"
  memory: "1Gi"
kubeReserved:
  cpu: "500m"
  memory: "1Gi"
maxPods: 110
serializeImagePulls: false              # pull in parallel

cgroupDriver is a frequent pain point. systemd (the modern default) vs cgroupfs. It must match the containerd config, otherwise kubelet sees pods as "not its own" and chaos ensues.

Static pods

These are pods that kubelet runs independently of the API server, by reading YAML from /etc/kubernetes/manifests/. They are used for the control plane (kube-apiserver, etcd, controller-manager, scheduler, each of them a static pod on the control plane node).

In the API they appear as mirror pods (mirror pod), but kubelet does not listen to their edits, the source of truth is the file.

Useful: for a self-bootstrapping control plane (like kubeadm), and for standalone agents.

CSI / CNI / Device Plugin

kubelet calls out to three externals:

  • CSI over gRPC (mounted at the well-known path /var/lib/kubelet/plugins/<driver>/csi.sock)
  • CNI by exec of a binary (/opt/cni/bin/<plugin>)
  • Device Plugin over gRPC (for GPU, FPGA, RDMA; a DaemonSet registers with kubelet, and kubelet passes it through to the pod)

When things go wrong

  • Node not ready right after start: kubelet could not reach the CRI socket, or CNI is not configured (/etc/cni/net.d/ is empty). Check journalctl -u kubelet.
  • failed to get system container stats: failed to get cgroup stats: a cgroup driver mismatch (systemd vs cgroupfs) between kubelet and containerd.
  • Pod Evicted with no obvious reason: run kubectl describe node and check Conditions: (MemoryPressure/DiskPressure/PIDPressure). Most often it is a full disk from the image cache. Run crictl rmi --prune.
  • exec format error in the pause container: a multi-arch image does not suit the node. Check kubectl describe pod for the image arch.
  • Pod stuck in Terminating: a finalizer is holding it, or CRI is not answering StopPodSandbox. Run crictl ps to see whether the container exists, then crictl logs <id>.
  • kubelet OOMs on itself: too many pods and large PodSpecs in RAM. Tune --max-pods, add --kube-api-qps/burst.
  • CSI mount hangs: the csi-node-plugin pod has crashed. Check kubectl get pods -n kube-system | grep csi.
  • kubelet log shows syncPod errored: frequent and generic. It helps to use --v=4 for verbose output, but the logs balloon.

Useful diagnostic artifacts

  • journalctl -u kubelet -f, the main logs
  • crictl ps (instead of docker ps for CRI runtimes)
  • crictl logs <id>, container logs
  • crictl images, what is pulled on the node
  • cat /var/lib/kubelet/config.yaml, the current config
  • curl -k https://localhost:10250/metrics, Prometheus metrics
  • curl -k https://localhost:10250/healthz

§ команды

bash
systemctl status kubelet

Daemon status, alive or dead, and the latest errors in the journal

bash
journalctl -u kubelet -f --since '5 min ago'

Stream of kubelet logs, the first place to look when pods on the node have trouble

bash
crictl ps -a

All containers through CRI, the alternative to docker ps when the runtime is containerd/CRI-O

bash
crictl logs <container-id>

Container stdout/stderr straight through CRI, without the k8s API

bash
crictl rmi --prune

Remove unused images, a quick way to free disk before eviction

bash
kubectl describe node <node-name>

Conditions, allocations, events, the first place to look when a node is unhealthy

bash
ls /etc/kubernetes/manifests/

Static pods, the control plane on kubeadm clusters

§ см. также

  • kubernetes-pod-lifecycleKubernetes pod lifecycle: from Pending to TerminatedA pod moves through phases Pending, Running, Succeeded/Failed/Unknown. Init containers run sequentially before the main ones. Probes: startup, then readiness/liveness. SIGTERM plus a grace period on delete.
  • runc-and-runscrunc, runsc, kata: container runtimesrunc is the standard OCI runtime: namespaces+cgroups+seccomp. runsc/gVisor is a userspace kernel for extra isolation. kata is a lightweight VM per container. Performance and isolation trade off against each other.
  • cgroups-v2-deepcgroups v2: unified hierarchy, PSI, eBPF controlcgroups v2 uses one tree instead of separate per-controller hierarchies. Clean semantics, new fields (memory.high, io.cost). PSI shows resource pressure. eBPF can manage resources. Default in RHEL 9, Ubuntu 22+.
Footer
linuxlab-
Copyright © 2026 LinuxLab. All rights reserved.
Tutorials
Pricing
About
Privacy & cookies