Why different filesystems exist
A filesystem decides how to lay out blocks on block-devices across inodes, how to find files by name, and what to do about crash recovery, parallelism, and fragmentation. Different priorities lead to different filesystems.
ext4, the workhorse
- Default on most Linux distributions
- Journaling (protection against a crash during a write)
- Up to 1 EiB per filesystem, 16 TiB per file
- Well studied, stable for 15+ years
- Downsides: the inode count is fixed at creation; no native snapshots; limited parallelism under heavy load
sudo mkfs.ext4 /dev/sdb1
sudo mkfs.ext4 -L data -m 1 /dev/sdb1 # label "data", reserve 1% (instead of the default 5%)
sudo tune2fs -l /dev/sdb1 # parameters of an existing filesystem
sudo e2fsck -f /dev/sdb1 # check (only when unmounted!)
sudo resize2fs /dev/sdb1 # fit to the partition size (after parted)
xfs, for large and parallel loads
- Default on RHEL/CentOS 7+, a common pick for databases and file servers
- Scales beautifully across many CPUs and large files
- Up to 8 EiB
- Downsides: you cannot shrink the filesystem (grow only); harder to recover after corruption
sudo mkfs.xfs /dev/sdb1
sudo mkfs.xfs -f -L data /dev/sdb1 # -f: force overwrite
sudo xfs_info /mnt/data # parameters
sudo xfs_growfs /mnt/data # grow (grow only!)
btrfs, copy-on-write
- Snapshots (like ZFS), subvolumes, native RAID 0/1/10
- Data and metadata checksums for bit rot detection
- Downsides: RAID 5/6 has been unstable historically; complex; can fragment under high-write load; recovery is harder than on ext4
sudo mkfs.btrfs -L data /dev/sdb1
sudo btrfs subvolume create /mnt/data/snap-base
sudo btrfs subvolume snapshot /mnt/data/work /mnt/data/snap-2024-01
sudo btrfs balance start /mnt/data # redistribute blocks
zfs, the most advanced, not in the mainline kernel
- Checksums, COW, snapshots, send/receive, deduplication, native RAID-Z
- Bit rot protection at the level of an enterprise SAN
- Downsides: NOT in the mainline kernel (a CDDL+GPL conflict); installed separately (zfsonlinux); hungry for RAM (the ARC cache); less widespread
- Main users: TrueNAS, Proxmox, many storage servers
Which one to choose
| Use case | Recommendation |
|---|---|
| Server root, ordinary files | ext4 |
| Databases, virtual disks, parallel I/O | xfs |
| Home NAS with snapshots | btrfs or zfs |
| Backup target with deduplication | zfs |
| Ephemeral container layer | overlay (on top of ext4/xfs) |
| Embedded / small filesystems | f2fs (NAND-aware) |
Tmpfs / overlay / proc / sysfs, the pseudo-filesystems
Not every filesystem lives on disk:
- tmpfs lives in RAM (
/tmp,/run,/dev/shm) - proc is
/proc, the interface to the kernel and process-and-pid - sysfs is
/sys, the interface to the device tree and drivers - devtmpfs is
/dev, dynamic device nodes - overlay holds the layers of Docker images (lower + upper = merged view)
- fuse is a userspace filesystem (sshfs, cephfs, your own)
Journaling
When a write crashes mid-way, the journal helps you recover:
- ext4 journals metadata (default
data=ordered);data=journalcovers the data too (slower, safer) - xfs keeps the log in a separate zone; you can move it to its own disk
(
-l logdev=...) - btrfs/zfs use COW, so a traditional journal is not needed
Checking integrity
sudo umount /mnt/data # ALWAYS unmount before fsck
sudo fsck -f /dev/sdb1 # the ext family
sudo xfs_repair /dev/sdb1 # XFS (only in crash mode; a no-op on a healthy filesystem)
sudo btrfs check /dev/sdb1 # btrfs (diagnostics only, does not fix)
sudo btrfs scrub start /mnt/data # btrfs: walk all data, verify checksums