Filesystems: ext4, xfs, btrfs, zfs

Why different filesystems exist

A filesystem decides how to lay out blocks on block-devices across inodes, how to find files by name, and what to do about crash recovery, parallelism, and fragmentation. Different priorities lead to different filesystems.

ext4, the workhorse

Default on most Linux distributions
Journaling (protection against a crash during a write)
Up to 1 EiB per filesystem, 16 TiB per file
Well studied, stable for 15+ years
Downsides: the inode count is fixed at creation; no native snapshots; limited parallelism under heavy load

bash

sudo mkfs.ext4 /dev/sdb1

sudo mkfs.ext4 -L data -m 1 /dev/sdb1   # label "data", reserve 1% (instead of the default 5%)

sudo tune2fs -l /dev/sdb1                # parameters of an existing filesystem

sudo e2fsck -f /dev/sdb1                 # check (only when unmounted!)

sudo resize2fs /dev/sdb1                 # fit to the partition size (after parted)

xfs, for large and parallel loads

Default on RHEL/CentOS 7+, a common pick for databases and file servers
Scales beautifully across many CPUs and large files
Up to 8 EiB
Downsides: you cannot shrink the filesystem (grow only); harder to recover after corruption

bash

sudo mkfs.xfs /dev/sdb1

sudo mkfs.xfs -f -L data /dev/sdb1       # -f: force overwrite

sudo xfs_info /mnt/data                   # parameters

sudo xfs_growfs /mnt/data                 # grow (grow only!)

btrfs, copy-on-write

Snapshots (like ZFS), subvolumes, native RAID 0/1/10
Data and metadata checksums for bit rot detection
Downsides: RAID 5/6 has been unstable historically; complex; can fragment under high-write load; recovery is harder than on ext4

bash

sudo mkfs.btrfs -L data /dev/sdb1

sudo btrfs subvolume create /mnt/data/snap-base

sudo btrfs subvolume snapshot /mnt/data/work /mnt/data/snap-2024-01

sudo btrfs balance start /mnt/data        # redistribute blocks

zfs, the most advanced, not in the mainline kernel

Checksums, COW, snapshots, send/receive, deduplication, native RAID-Z
Bit rot protection at the level of an enterprise SAN
Downsides: NOT in the mainline kernel (a CDDL+GPL conflict); installed separately (zfsonlinux); hungry for RAM (the ARC cache); less widespread
Main users: TrueNAS, Proxmox, many storage servers

Which one to choose

Use case	Recommendation
Server root, ordinary files	`ext4`
Databases, virtual disks, parallel I/O	`xfs`
Home NAS with snapshots	`btrfs` or `zfs`
Backup target with deduplication	`zfs`
Ephemeral container layer	`overlay` (on top of ext4/xfs)
Embedded / small filesystems	`f2fs` (NAND-aware)

Tmpfs / overlay / proc / sysfs, the pseudo-filesystems

Not every filesystem lives on disk:

tmpfs lives in RAM (/tmp, /run, /dev/shm)
proc is /proc, the interface to the kernel and process-and-pid
sysfs is /sys, the interface to the device tree and drivers
devtmpfs is /dev, dynamic device nodes
overlay holds the layers of Docker images (lower + upper = merged view)
fuse is a userspace filesystem (sshfs, cephfs, your own)

Journaling

When a write crashes mid-way, the journal helps you recover:

ext4 journals metadata (default data=ordered); data=journal covers the data too (slower, safer)
xfs keeps the log in a separate zone; you can move it to its own disk (-l logdev=...)
btrfs/zfs use COW, so a traditional journal is not needed

Checking integrity

bash

sudo umount /mnt/data           # ALWAYS unmount before fsck

sudo fsck -f /dev/sdb1           # the ext family

sudo xfs_repair /dev/sdb1        # XFS (only in crash mode; a no-op on a healthy filesystem)

sudo btrfs check /dev/sdb1       # btrfs (diagnostics only, does not fix)

sudo btrfs scrub start /mnt/data # btrfs: walk all data, verify checksums

Why different filesystems exist

ext4, the workhorse

Default on most Linux distributions
Journaling (protection against a crash during a write)
Up to 1 EiB per filesystem, 16 TiB per file
Well studied, stable for 15+ years
Downsides: the inode count is fixed at creation; no native snapshots; limited parallelism under heavy load

bash

sudo mkfs.ext4 /dev/sdb1

sudo mkfs.ext4 -L data -m 1 /dev/sdb1   # label "data", reserve 1% (instead of the default 5%)

sudo tune2fs -l /dev/sdb1                # parameters of an existing filesystem

sudo e2fsck -f /dev/sdb1                 # check (only when unmounted!)

sudo resize2fs /dev/sdb1                 # fit to the partition size (after parted)

xfs, for large and parallel loads

Default on RHEL/CentOS 7+, a common pick for databases and file servers
Scales beautifully across many CPUs and large files
Up to 8 EiB
Downsides: you cannot shrink the filesystem (grow only); harder to recover after corruption

bash

sudo mkfs.xfs /dev/sdb1

sudo mkfs.xfs -f -L data /dev/sdb1       # -f: force overwrite

sudo xfs_info /mnt/data                   # parameters

sudo xfs_growfs /mnt/data                 # grow (grow only!)

btrfs, copy-on-write

Snapshots (like ZFS), subvolumes, native RAID 0/1/10
Data and metadata checksums for bit rot detection
Downsides: RAID 5/6 has been unstable historically; complex; can fragment under high-write load; recovery is harder than on ext4

bash

sudo mkfs.btrfs -L data /dev/sdb1

sudo btrfs subvolume create /mnt/data/snap-base

sudo btrfs subvolume snapshot /mnt/data/work /mnt/data/snap-2024-01

sudo btrfs balance start /mnt/data        # redistribute blocks

zfs, the most advanced, not in the mainline kernel

Checksums, COW, snapshots, send/receive, deduplication, native RAID-Z
Bit rot protection at the level of an enterprise SAN
Downsides: NOT in the mainline kernel (a CDDL+GPL conflict); installed separately (zfsonlinux); hungry for RAM (the ARC cache); less widespread
Main users: TrueNAS, Proxmox, many storage servers

Which one to choose

Use case	Recommendation
Server root, ordinary files	`ext4`
Databases, virtual disks, parallel I/O	`xfs`
Home NAS with snapshots	`btrfs` or `zfs`
Backup target with deduplication	`zfs`
Ephemeral container layer	`overlay` (on top of ext4/xfs)
Embedded / small filesystems	`f2fs` (NAND-aware)

Tmpfs / overlay / proc / sysfs, the pseudo-filesystems

Not every filesystem lives on disk:

tmpfs lives in RAM (/tmp, /run, /dev/shm)
proc is /proc, the interface to the kernel and process-and-pid
sysfs is /sys, the interface to the device tree and drivers
devtmpfs is /dev, dynamic device nodes
overlay holds the layers of Docker images (lower + upper = merged view)
fuse is a userspace filesystem (sshfs, cephfs, your own)

Journaling

When a write crashes mid-way, the journal helps you recover:

ext4 journals metadata (default data=ordered); data=journal covers the data too (slower, safer)
xfs keeps the log in a separate zone; you can move it to its own disk (-l logdev=...)
btrfs/zfs use COW, so a traditional journal is not needed

Checking integrity

bash

sudo umount /mnt/data           # ALWAYS unmount before fsck

sudo fsck -f /dev/sdb1           # the ext family

sudo xfs_repair /dev/sdb1        # XFS (only in crash mode; a no-op on a healthy filesystem)

sudo btrfs check /dev/sdb1       # btrfs (diagnostics only, does not fix)

sudo btrfs scrub start /mnt/data # btrfs: walk all data, verify checksums

Why different filesystems exist

ext4, the workhorse

xfs, for large and parallel loads

btrfs, copy-on-write

zfs, the most advanced, not in the mainline kernel

Which one to choose

Tmpfs / overlay / proc / sysfs, the pseudo-filesystems

Journaling

Checking integrity

§ команды

§ см. также

Filesystems: ext4, xfs, btrfs, zfs

Why different filesystems exist

ext4, the workhorse

xfs, for large and parallel loads

btrfs, copy-on-write

zfs, the most advanced, not in the mainline kernel

Which one to choose

Tmpfs / overlay / proc / sysfs, the pseudo-filesystems

Journaling

Checking integrity

§ команды

§ см. также