#server-slow-where-start
You get a call: 'prod is slow'. What are the first 5 commands?
Что отвечать
`uptime` for the load average and uptime (did it just reboot?). `top -bn1 | head -20` for who is eating CPU right now. `vmstat 1 5` for CPU, IO, and swap by ticks. `dmesg --since "1 hour ago"` for OOM, kernel errors, a dropped disk. `df -h && df -i` for space and inodes. That gives you a process/memory/disk picture in 30 seconds, and from there the diagnosis gets targeted.
Что хотят услышать
A senior should: - name the bottom-up order: resources first (CPU, RAM, IO, disk), then network connections, then the application - mention the USE method (Brendan Gregg): for each resource, Utilization, Saturation, Errors - say that `top` shows an instant snapshot, that a trend needs Prometheus and Grafana, and that the past needs `sar` (sysstat) - name `ss -s` for a connection summary and `iostat -x 1 5` for disks - not jump straight into strace or perf, the second-tier tools, used once the first five commands have narrowed the area
Подводные камни
- ✗ Running tcpdump or strace right away. Too narrow without context.
- ✗ Not checking `dmesg`, which often holds the direct answer (OOM, hardware error).
- ✗ Forgetting `df -i`. Inode exhaustion gives 'no space' while there is free space.
Follow-up
- ? What does `vmstat 1 10` show, and which columns matter most?
- ? How does the USE method differ from the RED method?
- ? Where do you look for the trend over the last day if there was no Prometheus?
Глубина в базе знаний