File systems and inodes

Questions about how the file system is built and where even experienced engineers slip. Inode versus path, hard link versus symlink, the df-versus-du discrepancy, fsync and what it costs, mount options and why they exist. These topics come up for Backend and SRE roles, anyone who deals with persistent storage.

6 вопросов · ~22 мин чтения

#inode-vs-path

intermediateчасто

What is an inode? How does a file name differ from the file itself?

Что отвечать

An inode is the kernel's metadata structure for a file: owner, permissions, size, timestamps, and pointers to the data blocks. A file name is a directory entry that links a name string to an inode number. The file is the inode plus its data; the name is only a label. One inode can have several names (a hard link); a file with an open fd stays alive until that fd is closed, even after `rm`.

Что хотят услышать

A candidate should: - tell apart the three things: the inode (holds metadata), the data blocks (hold the contents), and the directory entry (links a name to an inode) - explain why `rm` does not free space right away when an fd is still open: the kernel decrements the link count, but while `i_count > 0` the physical free is deferred until the fd closes - name `ls -i` and `stat` as the tools to see the inode number - explain why the inode table is fixed at `mkfs` time, which is why `df -i` matters on a file system with millions of small files

Подводные камни

✗ Saying 'a file equals its name'. No, the name is only a reference to an inode.
✗ Not knowing about `df -i`, a classic cause of 'there is space, but the file will not create'.
✗ Assuming `rm` frees space instantly. It does not, if someone holds an fd.

Follow-up

? What does `lsof | grep deleted` do, and when do you need it?
? Can you create a file when all inodes are used up but disk space is free?
? How does the ext4 inode layout differ from XFS?

Глубина в базе знаний

#hard-vs-symlink

juniorчасто

What is the difference between a hard link and a symbolic link?

Что отвечать

A hard link is a second name for the same inode. It is indistinguishable from the original, both share one inode number, and removing one of the names leaves the data alone as long as at least one link remains. A symbolic link is a separate file with its own inode that holds a path to the target. A symlink can point at a path that does not exist and at files on another file system; a hard link works only within the same file system and not on a directory.

Что хотят услышать

A senior should: - name the limits of a hard link: only within one file system, and not to directories (apart from `.` and `..`, which the kernel maintains) - explain why a hard link to a directory is forbidden: the file system tree would stop being a tree and `find /` would loop - say that `cp -a` copies a symlink as a symlink (it does not follow it), while `cp -L` follows it - mention `readlink -f` for the canonical path with symlinks resolved

Подводные камни

✗ Getting `rm` on the original wrong: with a hard link the original has no special status, while for a symlink deleting the target leaves the link dangling.
✗ Claiming a symlink is faster than a hard link. Both are O(1), there is no difference.
✗ Not knowing that a symlink can point to another file system while a hard link cannot.

Follow-up

? Why does `mv` across a file system boundary physically copy the file, while a move within the same file system does not?
? What does `stat symlink` print versus `stat -L symlink`?
? How does a bind mount work, and how is it different from a symlink to a directory?

Глубина в базе знаний

#df-vs-du

intermediateчасто

Why do `df` and `du` show different numbers? What do you do about it?

Что отвечать

`df` reads the superblock: how much is used at the file system level, including open-but-deleted files and sparse regions. `du` walks the directories and sums the size of the visible files. If a process holds an fd on a deleted gigabyte log, `df` shows the space as used while `du` does not see it. The fix: find the process with `lsof | grep deleted` and nudge it (gracefully with SIGHUP to reopen, or the hard way with a restart).

Что хотят услышать

A candidate should name the typical reasons for the discrepancy: - open deleted files (the most common; lsof finds them) - sparse files (`du` counts the holes as used space by default, so real disk usage is lower; use `du --apparent-size` for the visible size) - reserved blocks (`tune2fs -m`, 5 percent for root by default), which `df` shows as used and `du` does not see - a mount overlay (a container sees only its own layer, the host sees everything)

Подводные камни

✗ Assuming it is always a bug. Usually it is expected behavior, an open fd.
✗ Not knowing `lsof | grep deleted`, the main diagnostic tool.
✗ Mixing up `du` and `du --apparent-size` on sparse files.

Follow-up

? How exactly does SIGHUP make syslog reopen its log file?
? What happens to an open fd when you run `truncate -s 0 file.log`?
? Why can `df` on an overlay file system in Docker be off by a factor of two?

Глубина в базе знаний

#fsync-vs-write

seniorиногда

What is the difference between write() and fsync()? When is the data on disk?

Что отвечать

`write()` copies the data into the kernel page cache and returns; the data is still in RAM. `fsync(fd)` blocks the process until the kernel flushes that file's dirty pages to the physical medium and gets an acknowledgment from the disk. Before `fsync`, a kernel crash or power loss loses everything that was sitting in the page cache. A database, and any system with durability guarantees, calls fsync on every commit.

Что хотят услышать

A senior should: - tell apart `fsync` (data plus metadata), `fdatasync` (data only, faster on ext4 for existing files), and `sync` (everything on the system) - note that the `O_DSYNC` and `O_SYNC` open flags make every write synchronous, which is slow but needs no explicit fsync - explain the role of the write barrier and why RAID controllers with a battery-backed cache can safely lie to the kernel about fsync (power will not be lost even on power loss) - mention that an NVMe SSD with power-loss protection gives a true fsync in single-digit microseconds, while an ordinary consumer SSD takes milliseconds

Подводные камни

✗ Assuming `write()` goes straight to disk. No, only to the page cache.
✗ Forgetting metadata: `fsync` syncs it too, `fdatasync` does not.
✗ Not knowing about the write barrier and why on a virtual disk fsync can be a no-op (if the virtualization layer ignores it).

Follow-up

? Why does PostgreSQL fsync the WAL rather than the data files?
? What is a write barrier, and why does the file system journal need it?
? How does `fsync` differ from `msync` for memory-mapped files?

Глубина в базе знаний

Page cache: disk in memory
Virtual memory: virtual addresses, page tables
[[io-uring]]

#mount-options

intermediateиногда

Why do you need the `noexec`, `nosuid`, and `nodev` options when mounting?

Что отвечать

This is security hardening for directories that an unprivileged user can write to. `noexec` forbids running binaries from that file system (even if the file has +x). `nosuid` ignores the SUID bit, so a binary does not gain the owner's privileges. `nodev` forbids interpreting device files. It is the usual set for `/tmp`, `/dev/shm`, `/var/tmp`, and `/home` on shared servers.

Что хотят услышать

A candidate should: - explain the threat: an attacker writes a SUID binary to `/tmp`, runs it, and gets root. `nosuid` blocks that. - note that `noexec` can be bypassed with `bash script.sh` (bash runs, not the script) or through `/lib/ld-linux.so script`, so it is not absolute protection - mention the CIS benchmark, where these options are a required item for production servers - say that these options apply to the file system itself and do not carry to bind mounts automatically; a bind mount inherits the original options

Подводные камни

✗ Assuming `noexec` fully forbids execution. `bash file` gets around it.
✗ Applying `nodev` to `/dev`, where the device files live. Everything breaks.
✗ Not knowing that `mount -o remount,noexec /tmp` applies it on the fly.

Follow-up

? Does `mount --bind` carry over the source mount options or not?
? Why should `/proc` be `nosuid,nodev,noexec` by default?
? What happens if you switch `/tmp` to `nosuid` while processes that opened a temp file there are still running?

Глубина в базе знаний

mount and /etc/fstab: attaching filesystems
[[setuid-setgid-sticky]]
CIS Benchmark and system hardening (lynis, OpenSCAP)

#fhs-where-things-live

juniorиногда

Where in Linux do configs, logs, and caches live, and why there?

Что отвечать

The directories are standardized by the FHS (Filesystem Hierarchy Standard). `/etc` holds system and service configs, `/var/log` holds logs, `/var/cache` holds caches that can be regenerated (safe to delete), `/var/lib` holds service state (do not delete it), `/usr/local` and `/opt` hold software that did not come from packages, `/tmp` is ephemeral across reboots (often tmpfs), and `/run` is runtime state (always tmpfs). The scheme historically keeps read-only data apart from mutable data so that `/usr` can be mounted read-only.

Что хотят услышать

A senior should: - tell apart `/var/log` (application logs) and `/run` (PID files and sockets, none of which survives a reboot) - say why `/var/cache` can be cleared painlessly while `/var/lib` cannot (it holds service databases, sessions, and queues) - name the XDG Base Directory Spec for user-level configs (`~/.config`, `~/.local/share`, `~/.cache`) - explain why `/usr/local` is for things built from source and `/opt` is for self-contained third-party software (Chrome, Atlassian)

Подводные камни

✗ Putting state in `/var/cache`. After `apt clean` or a CI cleanup the service loses its data.
✗ Not knowing about `/run`: on modern distros `/var/run` is a symlink to `/run`, and scripts often still write the old path.
✗ Confusing `/usr/local` and `/opt`.

Follow-up

? Why did `/usr/bin` merge with `/bin` on modern Debian and RHEL? (UsrMerge)
? What lives in `/sys`, and how is it different from `/proc`?
? Why does `/var/spool` exist, and which services use it?

Глубина в базе знаний

Filesystem Hierarchy Standard (FHS)
[[tmpfs-overlayfs]]

File systems and inodes

6 вопросов · ~22 мин чтения