Why ext4 specifically
ext4 is the default on Debian, Ubuntu, Linux Mint, Arch-based distributions, and many others. It is well understood, the tooling matured over 15 years, and it has the largest body of recovery cases. Compared to its predecessors:
| Version | What it added |
|---|---|
| ext2 (1993) | the base filesystem |
| ext3 (2001) | journaling |
| ext4 (2008) | extents, 1 EiB filesystem, 16 TiB files, online defrag, multi-block alloc |
If the goal is "put any filesystem on it and forget about it", use ext4. If you have millions of small files or a single file in the terabyte range, look at xfs or btrfs.
Journaling
The main difference from ext2 is the journal. When metadata changes (inode, directories, bitmaps), the plan of changes is written to a circular journal first, then the changes themselves. A crash before the journal commit rolls back; a crash after it replays the journal and applies the changes.
Three modes, selected with mount -o data=...:
| Mode | What is journaled | When |
|---|---|---|
data=writeback | metadata only; data can hit the disk BEFORE or AFTER the metadata | maximum speed, with the risk that "a file points at someone else's blocks" after a crash |
data=ordered (default) | metadata after the data is flushed | a compromise: metadata stays consistent with the data |
data=journal | both metadata and data go through the journal | maximum safety, 2x slower |
For a database it is sometimes worth using data=writeback, where the application WAL
takes on crash-safety. For a container host, use the default.
The journal lives on the same filesystem in a special inode (8). You can move it to a separate fast disk:
mke2fs -O journal_dev /dev/nvme0n1
mke2fs -t ext4 -J device=/dev/nvme0n1 /dev/sda1
Inode density is a fixed characteristic
At mkfs.ext4 time, the inode count is set to filesystem_size / bytes-per-inode.
The default is 1 inode per 16 KiB. On a 1 TiB filesystem that is about 67M inodes by default.
What matters: you cannot add inodes after mkfs. df -i shows the
usage. If you hit 100% inodes while gigabytes are still free, the only
option is to recreate the filesystem.
For systems with millions of small files (a mail spool, a cache), raise the density:
# 1 inode per 4 KiB - four times as many
mkfs.ext4 -i 4096 /dev/sdb1
# Or via a profile from /etc/mke2fs.conf
mkfs.ext4 -T news /dev/sdb1
For huge files (video, backups), lower it (-i 65536). You save space
and speed up fsck.
Block size
The default is 4 KiB on x86. Sizes of 1, 2, and 4 KiB are supported. Do not change it without a reason:
- 4K is the optimum for the kernel page size
- <4K spends more overhead
-
4K is not supported on x86 (on ARM/POWER you can use 16K, 64K)
noatime, relatime, lazytime
Under POSIX every read has to update atime (inode). That is
a write on every read, which is lethal for performance.
| Option | What it does |
|---|---|
atime (default) | atime on every read |
relatime | atime updates if the previous value is < mtime/ctime, or older than a day |
noatime | never touch atime |
lazytime | timestamps live in cache only, flushed to disk once a day |
For production, use noatime or lazytime. Modern distributions set
relatime by default.
UUID=... / ext4 defaults,noatime,lazytime,errors=remount-ro 0 1
Useful mkfs/tune2fs options
# Creation
mkfs.ext4 -L data -m 1 -E lazy_itable_init=1,lazy_journal_init=1 /dev/sdb1
-L LABELsets the label-m Nsets the reserve for root (default 5%, which is 500GB on a 10TB disk!)-E lazy_itable_init=1does not zero the inode table at creation (much faster on large disks; a background process zeroes it later)-O ^has_journalmeans no journal (only if you know why, for example: an externaljournal_devis already set, or the partition is temporary)-T usage_typeacceptsnews,largefile,largefile4
Tuning an existing filesystem:
tune2fs -l /dev/sda1 # filesystem parameters
tune2fs -m 1 /dev/sda1 # lower reserved to 1%
tune2fs -L data /dev/sda1 # change the label
tune2fs -O ^has_journal /dev/sda1 # disable the journal (dangerous)
tune2fs -c 0 -i 0 /dev/sda1 # disable mount-count and time-based fsck
Online resize and shrink
ext4 supports grow and shrink on an unmounted filesystem:
# Grow (online or offline)
resize2fs /dev/sda1 # to the full partition size
resize2fs /dev/sda1 100G # to 100 GiB
# Shrink (offline only)
umount /dev/sda1
e2fsck -f /dev/sda1 # a mandatory check
resize2fs /dev/sda1 50G
Unlike xfs, which can only grow, this is a plus for ext4.
fsck
Only on an unmounted filesystem:
umount /mnt/data
e2fsck -f /dev/sda1 # -f forces it even on a "clean" filesystem
e2fsck -y /dev/sda1 # -y answers "yes to everything" (for scripts)
For the root partition there is errors=remount-ro in fstab. On a filesystem error
it remounts the volume read-only automatically. More in fsck-and-recovery.
When something goes wrong
No space leftwith free GB: you ran out of inodes (df -i). Delete small files or recreate the filesystem with-i 4096.Read-only file system:errors=remount-rotriggered. Checkdmesg | grep EXT4-fsfor the cause. Often a bad sector.- Files gone after a crash:
data=writebackwithout an [[mount-and-fstab|fsync]] from the application. Lessons: fsync(),O_DSYNCfor critical data. - Very slow after
mkfson a large disk:lazy_itable_init=1is still working in the background.dmesg | grep ext4shows it. tune2fs: Filesystem has unsupported feature(s): an old distribution does not know the feature. Checkdumpe2fs -h /dev/sdX | grep featuresand updatee2fsprogs.- 5% reserved bytes: on large disks use
-m 1or-m 0. The reserve is needed only on a root filesystem so that the system can keep running once it fills up.
Checking the state
dumpe2fs -h /dev/sda1 # the superblock without group details
debugfs -R 'stat <inode>' /dev/sda1 # details for a specific inode
filefrag -v /path/to/file # fragmentation of a specific file
e4defrag /path # online defrag (rarely needed)