fsck and recovery: checking and repairing a filesystem

When something breaks

A filesystem can end up in a "dirty" state after:

Power loss / kernel panic during a write
Bad sectors on an HDD/SSD, where the storage stack returned an error
A disk controller that lied about fsync (write-back cache without a BBU)
Memory corruption (ECC, or the lack of it)
A bug in the kernel or the filesystem
An accidental write to a raw device past the filesystem

Symptoms:

Read-only file system after mount (errors=remount-ro)
Structure needs cleaning at the next mount
Input/output error when reading specific files
Odd ls behavior (garbage names, files disappearing)
The machine stuck at boot on "checking filesystems..."

Strategy:

Stop. Do not write to the damaged filesystem.
Back up the raw device (dd or ddrescue) before any edits.
If there is a journal, try a replay at mount.
If that does not help, do an offline fsck/repair.
If even that fails, extract the data with recovery utilities.

Journal replay, the automatic magic

At mount ext4/xfs/btrfs see that the previous umount was dirty, and they replay the journal: they apply changes that were committed but not written. This takes seconds and resolves most crash scenarios.

What you will see in dmesg:

EXT4-fs (sda1): recovery required on readonly filesystem

EXT4-fs (sda1): write access will be enabled during recovery

EXT4-fs (sda1): mounted filesystem with ordered data mode

XFS (sdb1): Mounting V5 Filesystem

XFS (sdb1): Starting recovery (logdev: internal)

XFS (sdb1): Ending recovery (logdev: internal)

If journal recovery fails, then reach for fsck.

fsck, the frontend

fsck is a universal wrapper that looks at the filesystem type and calls the right tool:

FS	Real binary
ext2/3/4	`e2fsck`
xfs	a `no-op` in fsck, use `xfs_repair` directly
btrfs	`btrfs check` directly
vfat	`dosfsck`
jfs	`jfs_fsck`

bash

# Run automatically from fstab at boot

fsck -A           # everything in fstab (passno > 0)

fsck /            # a specific mountpoint

fsck /dev/sda1    # a specific device

Options:

Option	What it does
`-A`	everything in fstab
`-y`	"yes to all", not safe if you do not understand the damage
`-n`	dry-run, show only
`-f`	force even on a clean filesystem
`-V`	verbose
`-C`	progress bar

The main rule: the filesystem must be unmounted. A live fsck on a mounted filesystem that is being written to means corruption. -n (RO) on a mounted filesystem is fine for diagnostics.

e2fsck, the ext family

bash

umount /dev/sda1

e2fsck -f /dev/sda1                # force, always

e2fsck -fy /dev/sda1               # auto-yes (for scripts and desperate cases)

e2fsck -fp /dev/sda1               # preen, fix what is safe, complain about the rest

Special cases:

bash

# Use a backup superblock if the primary one is broken

e2fsck -b 32768 /dev/sda1                    # on ext4 with 4K blocks

mke2fs -n /dev/sda1                          # shows where the backups are (no write!)

dumpe2fs -h /dev/sda1 | grep -i backup       # on a healthy filesystem

What it prints on corruption:

Pass 1: Checking inodes, blocks, and sizes

Inode 12345 has illegal block(s).  Clear<y>?

Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts

Pass 5: Checking group summary information

Files that lost their name but still have an inode land in /lost+found/ named by inode number. The content is readable; you recover the name from the content.

xfs_repair, XFS

bash

umount /dev/sdb1

xfs_repair /dev/sdb1                # normal

xfs_repair -n /dev/sdb1             # dry-run

xfs_repair -L /dev/sdb1             # ZERO LOG, last resort!

-L zeroes the journal, losing whatever was not committed. Do it only when a normal xfs_repair fails with "log is corrupt or cannot be read".

XFS has no lost+found concept in the same form. Orphan inodes are reattached automatically or removed.

btrfs check / btrfs rescue

bash

umount /dev/sdc1

btrfs check /dev/sdc1                  # diagnostics only

btrfs check --repair /dev/sdc1         # DANGEROUS, can make it worse!

btrfs check --repair has been marked experimental for years. Use it only when you have nothing to lose. Try btrfs rescue first:

bash

btrfs rescue super-recover /dev/sdc1   # recover the super blocks

btrfs rescue chunk-recover /dev/sdc1   # recover the chunk tree

btrfs rescue zero-log /dev/sdc1        # like xfs_repair -L

If the filesystem will not mount, try RO:

bash

mount -o ro,recovery,nologreplay /dev/sdc1 /mnt

and rescue the data with cp -a or btrfs send.

fstab, passno

The sixth field in fstab says when to fsck at boot:

UUID=...  /     ext4  defaults  0 1

UUID=...  /var  ext4  defaults  0 2

UUID=...  /tmp  tmpfs defaults  0 0

0: do not check
1: root, checked first
2: checked after root, in parallel with other 2 entries

XFS and btrfs set passno=0, because fsck is a no-op for them (they use their own tools when needed).

errors=remount-ro

A mount option, set in fstab:

UUID=...  /  ext4  defaults,errors=remount-ro  0 1

When the kernel sees a filesystem error, it remounts RO instead of a panic. This saves the data: there is nowhere left to write, you can read dmesg, unmount (or boot from rescue), and run fsck.

Alternatives: errors=continue (do nothing, dangerous), errors=panic (kernel panic, for embedded).

ddrescue for a dying disk

When an HDD is failing and a plain dd chokes on bad sectors:

bash

ddrescue /dev/sda backup.img backup.log

It builds an image, skipping the bad sectors (without hanging on them), and keeps a log. A second run reads only the problem areas it missed. Before any xfs_repair --L or btrfs check --repair, make a ddrescue copy first.

Recovering a file

If a file was deleted but its inode was not overwritten:

ext4: extundelete /dev/sda1 --restore-file path/to/file or e2undel. Unmount the filesystem before you run it.
xfs: xfs_undelete (a community tool, does not always work).
btrfs: you can pull it from a snapshot if one existed. btrfs restore from a broken FS.
Any FS: photorec from the testdisk package, a signature-based search by file type.

The longer the filesystem runs after the deletion, the smaller your chances. Unmount right away.

Best practices

Backups make fsck unnecessary
errors=remount-ro in fstab, always
smartctl -a watches S.M.A.R.T. and predicts a crash
scrub on btrfs/zfs once a week
fsck -fp at boot for the passno=1 entry catches small issues
ddrescue before a repair when you suspect the hardware
Document the UUIDs of your partitions and your backup superblocks

When something goes wrong

fsck.xfs does nothing: this is normal, fsck.xfs is a stub. Use xfs_repair.
fsck hangs: xfs_repair on a huge filesystem can take hours. Leave it alone.
Bad superblock: use an ext4 backup superblock through e2fsck -b $BACKUP /dev/sda1.
UUID conflict: after dd the copy has the same UUID. Run tune2fs -U random /dev/sdb1.
lost+found empty after fsck: there was nothing to recover as an orphan, or the filesystem was healthy.
the system will not boot because fsck failed: boot from rescue (init=/bin/sh, a USB live), and run e2fsck -fy /dev/sdaN by hand.

When something breaks

A filesystem can end up in a "dirty" state after:

Power loss / kernel panic during a write
Bad sectors on an HDD/SSD, where the storage stack returned an error
A disk controller that lied about fsync (write-back cache without a BBU)
Memory corruption (ECC, or the lack of it)
A bug in the kernel or the filesystem
An accidental write to a raw device past the filesystem

Symptoms:

Read-only file system after mount (errors=remount-ro)
Structure needs cleaning at the next mount
Input/output error when reading specific files
Odd ls behavior (garbage names, files disappearing)
The machine stuck at boot on "checking filesystems..."

Strategy:

Stop. Do not write to the damaged filesystem.
Back up the raw device (dd or ddrescue) before any edits.
If there is a journal, try a replay at mount.
If that does not help, do an offline fsck/repair.
If even that fails, extract the data with recovery utilities.

Journal replay, the automatic magic

What you will see in dmesg:

EXT4-fs (sda1): recovery required on readonly filesystem

EXT4-fs (sda1): write access will be enabled during recovery

EXT4-fs (sda1): mounted filesystem with ordered data mode

XFS (sdb1): Mounting V5 Filesystem

XFS (sdb1): Starting recovery (logdev: internal)

XFS (sdb1): Ending recovery (logdev: internal)

If journal recovery fails, then reach for fsck.

fsck, the frontend

fsck is a universal wrapper that looks at the filesystem type and calls the right tool:

FS	Real binary
ext2/3/4	`e2fsck`
xfs	a `no-op` in fsck, use `xfs_repair` directly
btrfs	`btrfs check` directly
vfat	`dosfsck`
jfs	`jfs_fsck`

bash

# Run automatically from fstab at boot

fsck -A           # everything in fstab (passno > 0)

fsck /            # a specific mountpoint

fsck /dev/sda1    # a specific device

Options:

Option	What it does
`-A`	everything in fstab
`-y`	"yes to all", not safe if you do not understand the damage
`-n`	dry-run, show only
`-f`	force even on a clean filesystem
`-V`	verbose
`-C`	progress bar

The main rule: the filesystem must be unmounted. A live fsck on a mounted filesystem that is being written to means corruption. -n (RO) on a mounted filesystem is fine for diagnostics.

e2fsck, the ext family

bash

umount /dev/sda1

e2fsck -f /dev/sda1                # force, always

e2fsck -fy /dev/sda1               # auto-yes (for scripts and desperate cases)

e2fsck -fp /dev/sda1               # preen, fix what is safe, complain about the rest

Special cases:

bash

# Use a backup superblock if the primary one is broken

e2fsck -b 32768 /dev/sda1                    # on ext4 with 4K blocks

mke2fs -n /dev/sda1                          # shows where the backups are (no write!)

dumpe2fs -h /dev/sda1 | grep -i backup       # on a healthy filesystem

What it prints on corruption:

Pass 1: Checking inodes, blocks, and sizes

Inode 12345 has illegal block(s).  Clear<y>?

Pass 2: Checking directory structure

Pass 3: Checking directory connectivity

Pass 4: Checking reference counts

Pass 5: Checking group summary information

Files that lost their name but still have an inode land in /lost+found/ named by inode number. The content is readable; you recover the name from the content.

xfs_repair, XFS

bash

umount /dev/sdb1

xfs_repair /dev/sdb1                # normal

xfs_repair -n /dev/sdb1             # dry-run

xfs_repair -L /dev/sdb1             # ZERO LOG, last resort!

-L zeroes the journal, losing whatever was not committed. Do it only when a normal xfs_repair fails with "log is corrupt or cannot be read".

XFS has no lost+found concept in the same form. Orphan inodes are reattached automatically or removed.

btrfs check / btrfs rescue

bash

umount /dev/sdc1

btrfs check /dev/sdc1                  # diagnostics only

btrfs check --repair /dev/sdc1         # DANGEROUS, can make it worse!

btrfs check --repair has been marked experimental for years. Use it only when you have nothing to lose. Try btrfs rescue first:

bash

btrfs rescue super-recover /dev/sdc1   # recover the super blocks

btrfs rescue chunk-recover /dev/sdc1   # recover the chunk tree

btrfs rescue zero-log /dev/sdc1        # like xfs_repair -L

If the filesystem will not mount, try RO:

bash

mount -o ro,recovery,nologreplay /dev/sdc1 /mnt

and rescue the data with cp -a or btrfs send.

fstab, passno

The sixth field in fstab says when to fsck at boot:

UUID=...  /     ext4  defaults  0 1

UUID=...  /var  ext4  defaults  0 2

UUID=...  /tmp  tmpfs defaults  0 0

0: do not check
1: root, checked first
2: checked after root, in parallel with other 2 entries

XFS and btrfs set passno=0, because fsck is a no-op for them (they use their own tools when needed).

errors=remount-ro

A mount option, set in fstab:

UUID=...  /  ext4  defaults,errors=remount-ro  0 1

When the kernel sees a filesystem error, it remounts RO instead of a panic. This saves the data: there is nowhere left to write, you can read dmesg, unmount (or boot from rescue), and run fsck.

Alternatives: errors=continue (do nothing, dangerous), errors=panic (kernel panic, for embedded).

ddrescue for a dying disk

When an HDD is failing and a plain dd chokes on bad sectors:

bash

ddrescue /dev/sda backup.img backup.log

Recovering a file

If a file was deleted but its inode was not overwritten:

ext4: extundelete /dev/sda1 --restore-file path/to/file or e2undel. Unmount the filesystem before you run it.
xfs: xfs_undelete (a community tool, does not always work).
btrfs: you can pull it from a snapshot if one existed. btrfs restore from a broken FS.
Any FS: photorec from the testdisk package, a signature-based search by file type.

The longer the filesystem runs after the deletion, the smaller your chances. Unmount right away.

Best practices

Backups make fsck unnecessary
errors=remount-ro in fstab, always
smartctl -a watches S.M.A.R.T. and predicts a crash
scrub on btrfs/zfs once a week
fsck -fp at boot for the passno=1 entry catches small issues
ddrescue before a repair when you suspect the hardware
Document the UUIDs of your partitions and your backup superblocks

When something goes wrong

fsck.xfs does nothing: this is normal, fsck.xfs is a stub. Use xfs_repair.
fsck hangs: xfs_repair on a huge filesystem can take hours. Leave it alone.
Bad superblock: use an ext4 backup superblock through e2fsck -b $BACKUP /dev/sda1.
UUID conflict: after dd the copy has the same UUID. Run tune2fs -U random /dev/sdb1.
lost+found empty after fsck: there was nothing to recover as an orphan, or the filesystem was healthy.
the system will not boot because fsck failed: boot from rescue (init=/bin/sh, a USB live), and run e2fsck -fy /dev/sdaN by hand.

fsck and recovery: checking and repairing a filesystem

When something breaks

Journal replay, the automatic magic

fsck, the frontend

e2fsck, the ext family

xfs_repair, XFS

btrfs check / btrfs rescue

fstab, passno

errors=remount-ro

ddrescue for a dying disk

Recovering a file

Best practices

When something goes wrong

§ команды

§ см. также

fsck and recovery: checking and repairing a filesystem

When something breaks

Journal replay, the automatic magic

fsck, the frontend

e2fsck, the ext family

xfs_repair, XFS

btrfs check / btrfs rescue

fstab, passno

errors=remount-ro

ddrescue for a dying disk

Recovering a file

Best practices

When something goes wrong

§ команды

§ см. также