Why btrfs
It appeared in 2009 as an answer to ZFS. It gave Linux:
- Copy-on-write (COW): a write does not modify the block in place; it writes to a new location and updates the pointer
- Subvolume: a named subdirectory you can mount, snapshot, and send on its own
- O(1) snapshots: a cheap copy of a subvolume through clever COW
- Data + metadata checksums: you see bit rot before it corrupts a file
- Native RAID 0/1/10: without mdraid on top
- Send/receive: incremental replication
It is the default on openSUSE Tumbleweed and Fedora Workstation 33+. On production servers it shows up less often because of the historical instability of RAID 5/6 and COW fragmentation.
Copy-on-write, the idea
When a block changes:
- Non-COW (ext4, xfs): write to the same block
- COW (btrfs, ZFS): write to a new block, update the pointer in the tree
Upsides:
- O(1) snapshot: just a new name that points at the same blocks
- Crash-safe without a journal: there is always a consistent point
- Checksums are verified automatically on read
Downsides:
- Fragmentation: frequent edits to one file scatter it across the disk. This hits databases (page-level edits) and VM images especially hard.
- Space is unpredictable: one file takes more room when a block changes, because the old version may be held by a snapshot.
The fix for databases and VMs is chattr +C (disable COW for specific files):
mkdir /data/postgres
chattr +C /data/postgres # for all new files inside
cp -a /old/postgres/* /data/postgres/
chattr +C takes effect only when a file is created; it will not apply to
existing files without recopying.
Subvolume
A subvolume is a separate tree inside one filesystem. It is not a partition; it is a logical unit of btrfs:
mkfs.btrfs /dev/sdb1
mount /dev/sdb1 /mnt
btrfs subvolume create /mnt/@home
btrfs subvolume create /mnt/@var
btrfs subvolume list /mnt
Advantages:
- You can mount a subvolume on its own (
mount -o subvol=@home) - Its own quotas, its own snapshots
- Isolated ENOSPC boundaries (optional, through quotas)
A common layout:
/mnt/ ← top-level
├─ @ ← root /
├─ @home ← /home
├─ @var ← /var
└─ @snapshots ← /.snapshots
In fstab:
UUID=... / btrfs defaults,subvol=@,compress=zstd:1 0 0
UUID=... /home btrfs defaults,subvol=@home,compress=zstd:1 0 0
Snapshots
# Snapshot of a subvolume
btrfs subvolume snapshot /mnt/@home /mnt/@snapshots/home-$(date +%F)
# Read-only snapshot (for backups through send)
btrfs subvolume snapshot -r /mnt/@home /mnt/@snapshots/home-ro
# Rollback: rename
btrfs subvolume delete /mnt/@home_broken
btrfs subvolume snapshot /mnt/@snapshots/home-2026-05-01 /mnt/@home
A snapshot is instant and takes 0 space at first. As edits accumulate, the original and the snapshot diverge, and space is spent on the difference.
Send / receive
Incremental replication to another btrfs filesystem:
# Full send of the first snapshot
btrfs send /mnt/@snapshots/home-2026-05-01 | ssh remote 'btrfs receive /backup/'
# Incremental
btrfs send -p /mnt/@snapshots/home-2026-05-01 /mnt/@snapshots/home-2026-05-02 \
| ssh remote 'btrfs receive /backup/'
It sends only the delta between snapshots, so backups are fast and compact. This is the better alternative to rsync for a large dataset.
RAID
btrfs does RAID by block, not by disk:
mkfs.btrfs -d raid1 -m raid1 /dev/sdb /dev/sdc # data + metadata mirror
mkfs.btrfs -d raid10 -m raid1 /dev/sd{b,c,d,e} # data RAID10, meta mirrormkfs.btrfs -d single -m raid1 /dev/sdb /dev/sdc # data not mirrored, metadata is
Levels:
single: no redundancydup: two copies on one disk (the default for metadata on a single disk)raid0: stripe with no redundancyraid1: mirror (but N-way: 2 copies across N+ disks, not like mdraid)raid1c3,raid1c4: 3-way and 4-way mirrorraid10: stripe of mirrorsraid5,raid6: historically unstable, the write hole. People avoided them until 2024. With newer kernels (6.x) the situation has improved, but multi-disk production setups still run more often on mdraid + ext4/xfs.
Add or remove a disk online:
btrfs device add /dev/sdd /mnt
btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt
btrfs device remove /dev/sdc /mnt
Checksums and scrub
btrfs stores a crc32c (or blake2/sha256/xxhash) checksum per block. On read it checks. If there is no match and RAID is present, it reads from another disk and repairs on the fly.
A preventive check:
btrfs scrub start /mnt
btrfs scrub status /mnt
A pass over the whole disk that verifies every block. On SAS/NVMe it runs in the background; on HDDs it is noticeable. Once a week or month from cron is normal.
Compression
... compress=zstd:1 ...
Transparent compression. Algorithms: zlib, lzo, zstd:1..15.
zstd:1 is fast and saves about 30% on typical text data.
Files that are already written can be recompressed:
btrfs filesystem defragment -r -czstd /mnt/data
When something goes wrong
No space leftwith free GB: btrfs allocates chunks separately for data and metadata; the metadata ran out.btrfs balance start -musage=50 /mntrepacks chunks. On a full filesystem even balance may not start, so delete at least something first.- The database is slow and fragments: COW was not turned off. Run
chattr +Con the directory BEFORE the database files are created. - Snapshots ate all the space. Delete old ones:
btrfs subvolume delete /mnt/@snapshots/old-*. Space is fully reclaimed after a balance. parent transid verify failed: corruption. Usebtrfs check(diagnostics only); for repair usebtrfs check --repair(RISKY!), or better mount-o ro,recoveryand rescue the data.- RAID 5/6 lost data after a crash: the known write-hole problem. Use RAID 1/10 for multi-disk.
- The disk is 90% full and slow: btrfs does not like being near full. Keep it under 80%.
When to choose btrfs
| Use case | Btrfs? |
|---|---|
| Workstation with auto-snapshots | ✓ (openSUSE/Snapper integration) |
| NAS without deduplication | ✗ (take ZFS instead) |
| NAS with snapshots and RAID 1/10 | ✓ |
| Database server | ✗ (ext4/xfs + LVM snapshot) |
| VM host with qcow2 | ✗ or with chattr +C |
| Container host | ✓ (native overlayfs or snapshotting) |
| RAID 5/6 needed in production | ✗ (mdraid + ext4/xfs) |