Networking: TCP/IP, routing, firewall

Networking questions are their own layer. Even strong Linux engineers slip here if they have not worked closely with networks. TCP states, MTU, NAT, routing, and iptables/nftables are must-know for SRE, DevOps, and Platform. The questions are drawn from interviews at Cloudflare, Datadog, Hashicorp, and Russian infrastructure teams.

8 вопросов · ~30 мин чтения

#tcp-three-way-handshake

intermediateчасто

Describe the TCP three-way handshake. What is in each packet?

Что отвечать

The client sends a SYN with its ISN (initial sequence number). The server replies with SYN-ACK: it acknowledges the client's ISN+1 and sends its own ISN. The client replies with an ACK for the server's ISN+1. After three packets both sides know each other's starting sequence numbers, and the connection is Established.

Что хотят услышать

A senior should say: - the ISN is chosen at random (not zero) to avoid predictability and hijacking attacks - in the SYN packet the client announces its options: MSS, window scale, SACK permitted, timestamp - the server's SYN-ACK carries its options too; the final choice is the intersection - after the handshake comes slow start: cwnd grows from the initial congestion window (10 segments on modern Linux) exponentially up to ssthresh - if a SYN is lost, there is an exponential backoff retry (1s, 2s, 4s, and so on), controlled by `net.ipv4.tcp_syn_retries`

Подводные камни

✗ Saying the handshake is 3 RTT. No, it is 1 RTT (3 packets, but the client can send data with the third ACK through TCP Fast Open).
✗ Forgetting the options in the SYN, which matter for understanding tuning.
✗ Confusing SYN-flood protection with SYN cookies. SYN cookies build the server ISN from a hash of the 4-tuple plus a secret, so there is no half-open state to hold.

Follow-up

? What is TCP Fast Open, and why did it never see wide adoption?
? How do SYN cookies work, and why are they turned on only under attack rather than by default?
? What does a SYN-ACK carry beyond the ISN acknowledgment?

Глубина в базе знаний

#time-wait

seniorчасто

What is TIME_WAIT? Why does it last 60 seconds and who does it annoy?

Что отвечать

TIME_WAIT is the socket state on the side that initiated the close, after the final ACK. It lasts `2*MSL` (Maximum Segment Lifetime), which is 60s on Linux. It keeps late segments from landing in a new connection with the same `(src_ip, src_port, dst_ip, dst_port)`, and it makes sure the peer's FIN gets through (the final ACK can be resent).

Что хотят услышать

A senior should: - explain that whoever initiates the close is the one that ends up in TIME_WAIT; load balancers often close the backends, so the LB suffers - name the tuning: `tcp_tw_reuse=1` for outbound connections, `SO_REUSEADDR` to bind a LISTEN socket after a restart, `ip_local_port_range` to widen the ephemeral port pool - say that `tcp_tw_recycle` was removed in Linux 4.12, since it broke NAT setups by relying on timestamps - name the symptom: `EADDRINUSE` when you try to bind the same port, or running out of ephemeral ports on the client side

Подводные камни

✗ Saying MSL is always 30s. It differs across operating systems, and on Linux the 60s for 2*MSL is hardcoded.
✗ Thinking `SO_REUSEADDR` fixes the client-side problem. No, it is about bind() for LISTEN, not about outbound connections.
✗ Recommending `tcp_tw_recycle`. It is removed and dangerous.

Follow-up

? At exactly what point does a socket move into TIME_WAIT, after the first FIN or after the final ACK?
? Why does `tcp_tw_recycle` break NAT, and what goes wrong with timestamps?
? How do Kubernetes and an Istio mesh suffer from TIME_WAIT, and what do you do about it?

Глубина в базе знаний

#mtu-pmtud

intermediateчасто

What is the MTU? What happens if an IP packet is larger than the MTU?

Что отвечать

The MTU (Maximum Transmission Unit) is the largest IP packet an interface will send in one frame. On Ethernet it is 1500 bytes. If a packet is larger, you get either fragmentation on the sender's side or an ICMP `Fragmentation needed` from a router. In IPv6 only the source fragments, and intermediate nodes do not.

Что хотят услышать

A senior should: - explain PMTUD (Path MTU Discovery): the sender sets the DF bit, gets an ICMP back, and lowers the MTU; it works poorly when a firewall blocks ICMP, which is the 'PMTUD black hole' - name TCP MSS = MTU - 40 (IP header plus TCP header); for TCP there is a separate MSS clamping mechanism in the firewall - name the typical case: a VPN, VXLAN, or GRE tunnel adds 50+ bytes of overhead, so packets inside the tunnel must be smaller, otherwise you get fragmentation or a blackhole - jumbo frames (MTU 9000) in data centers for efficiency

Подводные камни

✗ Saying the MTU is always 1500. That is for Ethernet; on loopback it is 65536, on a VXLAN overlay 1450.
✗ Forgetting the DF bit and its role in PMTUD.
✗ Not mentioning MSS clamping, the typical fix for VPN problems.

Follow-up

? What is a PMTUD black hole, and how do you deal with it?
? How does `ping -M do -s 1472 <host>` test the MTU, and where does 1472 come from?
? What is the difference between MTU and MSS? Who announces each, and when?

Глубина в базе знаний

[[mtu-and-pmtud]]
Ethernet Frame
[[vxlan-overlay]]

#dns-resolution-path

intermediateчасто

What happens when I type `curl google.com`? Where and how does the name resolve?

Что отвечать

curl calls `getaddrinfo("google.com")` through glibc or musl. glibc goes to NSS (`nsswitch.conf`), usually `/etc/hosts` first, then DNS. The DNS query flies to the nameserver from `/etc/resolv.conf` (on Ubuntu that is systemd-resolved on 127.0.0.53, which forwards to the real DNS servers). The resolver walks the DNS hierarchy: root, then TLD, then authoritative.

Что хотят услышать

A senior should name: - `getaddrinfo` (not the obsolete `gethostbyname`) and the fact that NSS is a plugin architecture (LDAP, mDNS, sssd can sit in front of DNS) - the difference between a stub resolver (in libc or systemd-resolved) and a recursive resolver (8.8.8.8, which caches answers) - the `/etc/nsswitch.conf` line `hosts: files dns`; the order matters, and `files` means `/etc/hosts` - `dig` as the right CLI tool (versus the obsolete `nslookup`) - that systemd-resolved on 127.0.0.53 is Ubuntu-only; on stock RHEL or Debian it is a direct `/etc/resolv.conf`

Подводные камни

✗ Saying curl goes straight to DNS. No, it goes through NSS, and `/etc/hosts` overrides everything.
✗ Forgetting the glibc cache and nscd, deprecated for about 5 years but still found on legacy systems.
✗ Not mentioning that systemd-resolved listens on 127.0.0.53, so `dig` against `127.0.0.1` will not reach it.

Follow-up

? How does a stub resolver differ from a recursive resolver?
? What is `nsswitch.conf`, and why is `files dns` the default rule?
? How do you view the systemd-resolved cache, and how do you clear it?

Глубина в базе знаний

#iptables-vs-nftables

seniorиногда

How does nftables differ from iptables? Why the replacement?

Что отвечать

nftables (since Linux 3.13) is the modern replacement for iptables, ip6tables, arptables, and ebtables. One syntax, one kernel subsystem (`nf_tables`), atomic updates over netlink (instead of iptables-restore), and support for sets and maps. On Ubuntu 22+ and Debian 12+ it is the default.

Что хотят услышать

A senior should: - explain that iptables 'stayed' as the `iptables-nft` wrapper: the same commands, but nf_tables underneath - name the advantages: one `nft` command for v4, v6, arp, and bridge; atomic rule sets (the whole set applies or nothing does); sets for matching large IP lists efficiently - mention the netfilter hooks: PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING, the points in the packet pipeline shared by both tools - say that nftables uses BPF underneath for performance

Подводные камни

✗ Saying nftables is fully incompatible. No, there is the `iptables-nft` shim for backward compatibility.
✗ Confusing `nftables` (the frontend) and `nf_tables` (the kernel subsystem).
✗ Forgetting that `firewalld` (the RHEL/CentOS frontend) and `ufw` (the Ubuntu frontend) exist as layers on top of iptables and nftables.

Follow-up

? What is a netfilter hook? Name all five in the order a packet passes through them.
? Why are atomic updates in nftables better than iptables-restore?
? How do `sets` work in nftables, and why are they faster than a chain of rules?

Глубина в базе знаний

#nat-types

intermediateчасто

What is the difference between SNAT, DNAT, and MASQUERADE?

Что отвечать

SNAT (Source NAT) rewrites the packet's source IP on the outbound interface and needs an explicit target IP. DNAT (Destination NAT) rewrites the destination IP or port and is used to forward ports to internal services. MASQUERADE is a special case of SNAT where the outbound IP is taken from the interface automatically (for dynamic IPs like a home DSL or 4G link).

Что хотят услышать

A senior should: - tell apart SNAT and MASQUERADE: SNAT with a fixed IP is faster (no need to look up the interface IP each time) but requires a static IP - explain that under the hood NAT runs through conntrack: the kernel keeps a `(src, dst, port) -> (translated)` table and reverses the translation on the reply packets - name `nf_conntrack_max` and the problem of exhausting it under heavy load (a classic SRE incident) - give examples: SNAT on a gateway router for a home network reaching the internet; DNAT on a load balancer (for instance, KubeProxy in iptables mode does DNAT to the pod IP)

Подводные камни

✗ Getting the direction of SNAT and DNAT wrong. SNAT always changes the SOURCE, DNAT always changes the DESTINATION, no matter which way the packet goes.
✗ Not mentioning conntrack. Without it NAT is impossible.
✗ Thinking MASQUERADE is faster than SNAT. The opposite is true: MASQUERADE does an extra interface-IP lookup each time.

Follow-up

? What happens when `nf_conntrack_max` is exhausted?
? How does Kubernetes implement a Service through iptables DNAT?
? What is hairpin NAT, and why do you need it?

Глубина в базе знаний

NAT: Network Address Translation
[[conntrack]]

#routing-table-lookup

seniorиногда

How does Linux choose the interface for an outbound packet?

Что отвечать

By the routing table: it looks for the longest prefix (longest prefix match) that covers the destination IP. If nothing matches, the default route (`0.0.0.0/0`). Linux supports several tables through **policy routing**: a rule (`ip rule`) picks the table based on source IP, fwmark, or incoming interface.

Что хотят услышать

A senior should: - explain longest prefix match and why a `/32` route beats a `/24` - name the order: rules (`ip rule list`), then tables (`ip route show table N`), then the cache (`ip route get <dst>` shows the final decision) - give a policy-routing example: two providers, where the route depends on the source IP (`ip rule add from 10.0.1.0/24 table isp_a`) - mention `ip route get <dst> from <src>` as the main command for diagnosing why a packet went the wrong way

Подводные камни

✗ Saying it just takes the 'next entry in the table'. No, only the longest prefix.
✗ Forgetting policy routing: on a single-interface machine there is none, but the moment you have two uplinks it is a must.
✗ Not knowing `ip route get`, the main debug tool.

Follow-up

? What does `ip route get 8.8.8.8 from 10.0.1.5` show?
? Why do the `local` and `main` tables exist in `ip rule`?
? What is a VRF, and how does it differ from policy routing?

Глубина в базе знаний

#tcpdump-vs-ss

juniorчасто

My service is not responding. What do I look at, tcpdump or ss?

Что отвечать

`ss -tlnp` first: check that the process really listens on the port on the right interface. If it does not, the problem is in the service or config, and tcpdump is not needed. If the port is open but connect fails, `tcpdump -i any port N` to see whether the SYN packets arrive and whether the server answers.

Что хотят услышать

A senior should: - show the troubleshooting workflow, layer by layer from low to high (interface up, IP assigned, route present, firewall passes, process listening, process answering) - name `ss -s` for a connection summary, `ss -tnp` for active TCP, `-l` for LISTEN - mention that `ss` reads directly over netlink (faster than netstat, which parses `/proc/net/tcp`) - tcpdump to confirm that the packets arrive at all, but not as the first-step diagnostic

Подводные камни

✗ Jumping straight into tcpdump without checking `ss`. The cause of the incident is usually already visible in `ss`.
✗ Using `netstat` instead of `ss`. netstat has been deprecated for 10+ years.
✗ Forgetting `-p` (show the process). Without it you cannot tell who holds the port.

Follow-up

? What does `ss -i` show (with details on the congestion window and RTT)?
? How does the tcpdump filter `tcp[tcpflags] & tcp-syn != 0` match only SYN packets?
? How does `tcpdump -i any` differ from `tcpdump -i eth0`?

Глубина в базе знаний

Networking: TCP/IP, routing, firewall

8 вопросов · ~30 мин чтения