Troubleshooting scenarios

Scenario questions are the most valuable in an interview. Not 'what is load average' but 'the on-call engineer calls, the server is slow, what do you do first'. These questions separate the person who read the textbook from the person who has put out incidents by hand. The scenarios are real cases from SRE interviews at Cloudflare, Datadog, Yandex, Avito, and mid-size Russian infrastructure teams.

8 вопросов · ~30 мин чтения

#server-slow-where-start

intermediateчасто

You get a call: 'prod is slow'. What are the first 5 commands?

Что отвечать

`uptime` for the load average and uptime (did it just reboot?). `top -bn1 | head -20` for who is eating CPU right now. `vmstat 1 5` for CPU, IO, and swap by ticks. `dmesg --since "1 hour ago"` for OOM, kernel errors, a dropped disk. `df -h && df -i` for space and inodes. That gives you a process/memory/disk picture in 30 seconds, and from there the diagnosis gets targeted.

Что хотят услышать

A senior should: - name the bottom-up order: resources first (CPU, RAM, IO, disk), then network connections, then the application - mention the USE method (Brendan Gregg): for each resource, Utilization, Saturation, Errors - say that `top` shows an instant snapshot, that a trend needs Prometheus and Grafana, and that the past needs `sar` (sysstat) - name `ss -s` for a connection summary and `iostat -x 1 5` for disks - not jump straight into strace or perf, the second-tier tools, used once the first five commands have narrowed the area

Подводные камни

✗ Running tcpdump or strace right away. Too narrow without context.
✗ Not checking `dmesg`, which often holds the direct answer (OOM, hardware error).
✗ Forgetting `df -i`. Inode exhaustion gives 'no space' while there is free space.

Follow-up

? What does `vmstat 1 10` show, and which columns matter most?
? How does the USE method differ from the RED method?
? Where do you look for the trend over the last day if there was no Prometheus?

Глубина в базе знаний

#high-load-low-cpu

seniorчасто

Load average 30, CPU usage 5%. What is going on?

Что отвечать

Load average counts both Runnable processes (R, waiting on CPU) and Uninterruptible Sleep (D, waiting on I/O). If the CPU is idle, a pile of processes is in D-state, usually because of a slow disk or a hung NFS. Look at `ps -eo state,pid,cmd | awk '$1 ~ /^D/'` to see which ones. Then `iostat -x 1` and `iotop` for who is generating I/O and where.

Что хотят услышать

A candidate should: - explain what load average is made of: R plus D, not R alone - name D-state as the prime suspect when 'load average is high, CPU is low', usually disk, NFS, or iSCSI - say that you cannot kill a D-state process even with kill -9, because it is in a syscall inside the kernel - name `iostat`, `iotop`, `/proc/<pid>/io`, `/proc/<pid>/stack` as the tools to dig further - mention that an NFS mount without `soft,intr` can hang dead and drag a pile of processes down with it

Подводные камни

✗ Saying 'load average is CPU'. No, it is R plus D processes.
✗ Not recalling D-state as the cause.
✗ Assuming perf or a profiler will help. No, a profiler sees CPU, and here the problem is I/O wait.

Follow-up

? What is `iowait` in the `top` output, and how is it tied to D-state?
? How does `mount -t nfs soft` differ from `hard`?
? How do you kill a process stuck in D-state on a dropped NFS mount?

Глубина в базе знаний

#disk-full-cant-find

intermediateчасто

df says 100% full, du on the root shows half. Where did the space go?

Что отвечать

Open deleted files. Someone holds an fd on a gigabyte log that was `rm`'d. The kernel does not free the blocks while the fd is open, but `du` no longer sees the file. The fix: `lsof | grep deleted` finds the culprit, then reload the process (SIGHUP for most daemons) or restart it. An alternative: `truncate -s 0 /proc/<pid>/fd/N` frees the blocks without restarting the process.

Что хотят услышать

A senior should: - name `lsof | grep deleted` as the first command, which closes most cases - explain why 'truncate through /proc/PID/fd/N' works: it is the same file the process holds, just without a directory entry - name the second common case: a process writes to a file that sits UNDER a mount point. If a file system was mounted over a non-empty directory, the files under the mount point still take up space but are not visible through `du /` until you `umount` - mention reserved blocks (`tune2fs -m 0` to free 5 percent) as an aggressive last resort

Подводные камни

✗ Searching for large files with `du` right away. `du` will not show them.
✗ Running `rm` again. The file is already gone, and the fd still holds the blocks.
✗ Restarting the whole service when a SIGHUP would have been enough.

Follow-up

? How does `logrotate` handle this properly, and what does `copytruncate` do?
? What happens if you run `> /proc/PID/fd/N`?
? Why does restarting the daemon free the space instantly?

Глубина в базе знаний

#oom-killed-my-service

intermediateчасто

A service died, the logs are empty. dmesg shows OOM. What next?

Что отвечать

`dmesg | grep -i oom` or `journalctl -k --grep=oom` shows the victim and who killed it (the kernel or systemd-oomd). Then look at the process's `oom_score` and its cgroup limit. If the cgroup limit is below the real need, either raise the limit (`memory.max` in cgroups v2) or fix the leak. If there was no limit and the kernel OOM did it, look at the memory-usage graph and at the neighbors sharing the load.

Что хотят услышать

A senior should: - tell apart the kernel OOM (no memory on the host) and a cgroup OOM (a container or service over its limit), two different cases - name `memory.high` as the graceful backpressure mechanism in v2 instead of the hard `memory.max` - say that `oom_score_adj=-1000` makes a process immune, used for critical components (sshd, for example) - mention that swap does NOT save you from OOM in the long run, it only delays it; thrashing is usually worse than an honest OOM - note that in Kubernetes an OOM in a container is a pod restart with reason `OOMKilled`, visible through `kubectl describe pod`

Подводные камни

✗ Thinking OOM always means the host is out of memory. It can be a cgroup limit while the host has memory free.
✗ Setting `oom_score_adj=-1000` everywhere. The system is left with no way to recover itself.
✗ Turning on swap as a 'fix'. It usually gives degradation instead of OOM.

Follow-up

? How does `systemd-oomd` differ from the kernel OOM?
? What does `memory.events` show in cgroup v2?
? How does `earlyoom` fire before the kernel OOM, and why do you need it?

Глубина в базе знаний

#dns-broken-debug

intermediateчасто

curl to example.com fails, ping works. What next?

Что отвечать

Ping works by IP, so there is network connectivity. curl fails, so name resolution is broken. Check: `dig example.com` directly (bypasses the glibc cache), then `cat /etc/resolv.conf`, then `getent hosts example.com` (through NSS, the way curl does it). A mismatch between `dig` and `getent` points to where it is broken: NSS, an `/etc/hosts` override, or systemd-resolved.

Что хотят услышать

A senior should: - tell apart curl's path (NSS, then resolved, then upstream DNS) and dig's path (straight to the named nameserver) - note that ping also goes through NSS, but if the user put an IP in the command no resolution is needed; 'ping by IP works' in this problem means the network is OK - check the `/etc/nsswitch.conf` `hosts:` line; the order matters, and sometimes `files` is mounted in through an override - mention that a Docker container has its own `/etc/resolv.conf` (from the Docker DNS), and in k8s it comes from CoreDNS, so the resolution path is different

Подводные камни

✗ Running `dig @8.8.8.8 example.com` right away. It bypasses your resolver and will not show the problem.
✗ Thinking `nslookup` is an adequate tool. It is dated and hides half the information; use `dig`.
✗ Not checking `/etc/hosts`. A plain override is often the cause.

Follow-up

? How does `getent hosts X` differ from `dig X`? When do they give different answers?
? How do you view the systemd-resolved cache and clear it?
? What happens with `search example.com.` in `/etc/resolv.conf`?

Глубина в базе знаний

#tcp-connection-hangs

seniorчасто

A client connects to the service, there is no reply, the connection hangs. What do you look for?

Что отвечать

Three places the break can be. `ss -tn state established | grep <port>`: if the connection is in Established, the problem is in the application (it does not answer). If not, `tcpdump -i any port N` shows whether the SYN left with no SYN-ACK (a firewall drop or the server is not listening), or the SYN-ACK arrived with no final ACK (a router drop, an MTU blackhole). Alongside that, `dmesg` and `conntrack -L`: the conntrack table may be full.

Что хотят услышать

A senior should: - name ss as the first tool and tcpdump as the second - tell apart three classes of cause: network (the SYN does not arrive), firewall (a DROP or conntrack overflow), and application (the connection exists, but the handler is stuck) - mention `nf_conntrack_max` and `nf_conntrack_count` as the overflow check, a classic SRE incident under load - name the MTU blackhole as a rare but nasty cause: the handshake goes through, then the first large packet sinks and there is no ICMP coming back (the firewall drops ICMP)

Подводные камни

✗ Running tcpdump full-tape right away. It fills the disk; you need filters.
✗ Not checking conntrack, which is typically ignored.
✗ Saying 'raise the timeout'. That treats the symptom, not the cause.

Follow-up

? How do you tell a firewall dropping the SYN apart from the server not listening, in tcpdump?
? What does `ss -i -t` show (with TCP-state info)?
? How do you detect an MTU blackhole with tracepath?

Глубина в базе знаний

TCP three-way handshake
TCP states (LISTEN, ESTABLISHED, TIME_WAIT)
[[conntrack]]
[[mtu-and-pmtud]]

#container-wont-start

intermediateчасто

A Docker container crashes right after start, exit code 137. What is it?

Что отвечать

137 = 128 + 9, so the process was killed by SIGKILL. Usually that is OOM, or kubelet/Docker exceeded `memory.max`. Check: `docker inspect <id>`, where `OOMKilled: true` and `ExitCode: 137` confirm OOM. `dmesg` shows the kernel side, `kubectl describe pod` (in k8s) shows that the limit fired. The fix: raise the memory limit, or fix the leak in the application.

Что хотят услышать

A candidate should: - decode the exit codes: 128+N means killed by signal N (130 = SIGINT/Ctrl+C, 137 = SIGKILL, 139 = SIGSEGV, 143 = SIGTERM) - name `docker inspect` and `kubectl describe pod` as the source of truth for the exit reason - explain the difference between a host OOM-kill and a cgroup OOM, which in Docker and k8s is usually the latter - mention liveness and readiness probes as another cause of restarts (a probe failure leads to a restart)

Подводные камни

✗ Mixing up exit code 137 (SIGKILL) with 139 (SEGV). Different causes.
✗ Thinking 137 always means OOM. It can be a plain `docker kill`.
✗ Not checking the probes, a typical cause of 'the pod restarts every 30 seconds'.

Follow-up

? What does exit code 143 mean, and when do you see it?
? How does `restartPolicy: OnFailure` differ from `Always` in k8s?
? How do you set a sensible memory limit when the application keeps crashing?

Глубина в базе знаний

#cron-doesnt-fire

juniorиногда

A cron job does not run. PATH, env, what do you check?

Что отвечать

Cron runs with a minimal environment: PATH is usually `/usr/bin:/bin`, there is no `.bashrc` of yours, and HOME can be `/`. A script that works in the shell but fails under cron is almost always this. Check: `cat /var/log/syslog | grep CRON` (or `journalctl -u cron`) shows whether it ran. If it ran and failed, add `>>/tmp/cron.log 2>&1` to the crontab line to see stderr.

Что хотят услышать

A senior should: - name the minimal PATH and the missing dotfiles as the main cause of 'works in the shell, fails in cron' - say that cron opens no terminal, so `tput`, `read`, and `sudo -A` break - mention `MAILTO` in the crontab, where stderr goes by default (if there is a local MTA) - name systemd timers as the modern alternative with far more control (env, restart, dependencies) - suggest `env -i bash -c '<command>'` as a way to reproduce the cron environment in the shell

Подводные камни

✗ Thinking cron inherits the user's `.bashrc`. No, it is not a login shell.
✗ Using `~` in paths. HOME may not be what you expect.
✗ Not redirecting stderr. The error messages are lost.

Follow-up

? Why do you put `MAILTO=""` at the top of a crontab?
? Why is a systemd timer better than cron for production?
? How do you keep a cron job from running in parallel with the previous instance?

Глубина в базе знаний

Troubleshooting scenarios

8 вопросов · ~30 мин чтения