btrfs: copy-on-write, subvolumes, and snapshots

Why btrfs

It appeared in 2009 as an answer to ZFS. It gave Linux:

Copy-on-write (COW): a write does not modify the block in place; it writes to a new location and updates the pointer
Subvolume: a named subdirectory you can mount, snapshot, and send on its own
O(1) snapshots: a cheap copy of a subvolume through clever COW
Data + metadata checksums: you see bit rot before it corrupts a file
Native RAID 0/1/10: without mdraid on top
Send/receive: incremental replication

It is the default on openSUSE Tumbleweed and Fedora Workstation 33+. On production servers it shows up less often because of the historical instability of RAID 5/6 and COW fragmentation.

Copy-on-write, the idea

When a block changes:

Non-COW (ext4, xfs): write to the same block
COW (btrfs, ZFS): write to a new block, update the pointer in the tree

Upsides:

O(1) snapshot: just a new name that points at the same blocks
Crash-safe without a journal: there is always a consistent point
Checksums are verified automatically on read

Downsides:

Fragmentation: frequent edits to one file scatter it across the disk. This hits databases (page-level edits) and VM images especially hard.
Space is unpredictable: one file takes more room when a block changes, because the old version may be held by a snapshot.

The fix for databases and VMs is chattr +C (disable COW for specific files):

bash

mkdir /data/postgres

chattr +C /data/postgres        # for all new files inside

cp -a /old/postgres/* /data/postgres/

chattr +C takes effect only when a file is created; it will not apply to existing files without recopying.

Subvolume

A subvolume is a separate tree inside one filesystem. It is not a partition; it is a logical unit of btrfs:

bash

mkfs.btrfs /dev/sdb1

mount /dev/sdb1 /mnt

btrfs subvolume create /mnt/@home

btrfs subvolume create /mnt/@var

btrfs subvolume list /mnt

Advantages:

You can mount a subvolume on its own (mount -o subvol=@home)
Its own quotas, its own snapshots
Isolated ENOSPC boundaries (optional, through quotas)

A common layout:

/mnt/             ← top-level

 ├─ @            ← root /

 ├─ @home        ← /home

 ├─ @var         ← /var

 └─ @snapshots   ← /.snapshots

In fstab:

fstab

UUID=...  /     btrfs  defaults,subvol=@,compress=zstd:1     0 0

UUID=...  /home btrfs  defaults,subvol=@home,compress=zstd:1 0 0

Snapshots

bash

# Snapshot of a subvolume

btrfs subvolume snapshot /mnt/@home /mnt/@snapshots/home-$(date +%F)

# Read-only snapshot (for backups through send)

btrfs subvolume snapshot -r /mnt/@home /mnt/@snapshots/home-ro

# Rollback: rename

btrfs subvolume delete /mnt/@home_broken

btrfs subvolume snapshot /mnt/@snapshots/home-2026-05-01 /mnt/@home

A snapshot is instant and takes 0 space at first. As edits accumulate, the original and the snapshot diverge, and space is spent on the difference.

Send / receive

Incremental replication to another btrfs filesystem:

bash

# Full send of the first snapshot

btrfs send /mnt/@snapshots/home-2026-05-01 | ssh remote 'btrfs receive /backup/'

# Incremental

btrfs send -p /mnt/@snapshots/home-2026-05-01 /mnt/@snapshots/home-2026-05-02 \

  | ssh remote 'btrfs receive /backup/'

It sends only the delta between snapshots, so backups are fast and compact. This is the better alternative to rsync for a large dataset.

RAID

btrfs does RAID by block, not by disk:

bash

mkfs.btrfs -d raid1 -m raid1 /dev/sdb /dev/sdc      # data + metadata mirror

mkfs.btrfs -d raid10 -m raid1 /dev/sd{b,c,d,e}      # data RAID10, meta mirror

mkfs.btrfs -d single -m raid1 /dev/sdb /dev/sdc     # data not mirrored, metadata is

Levels:

single: no redundancy
dup: two copies on one disk (the default for metadata on a single disk)
raid0: stripe with no redundancy
raid1: mirror (but N-way: 2 copies across N+ disks, not like mdraid)
raid1c3, raid1c4: 3-way and 4-way mirror
raid10: stripe of mirrors
raid5, raid6: historically unstable, the write hole. People avoided them until 2024. With newer kernels (6.x) the situation has improved, but multi-disk production setups still run more often on mdraid + ext4/xfs.

Add or remove a disk online:

bash

btrfs device add /dev/sdd /mnt

btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt

btrfs device remove /dev/sdc /mnt

Checksums and scrub

btrfs stores a crc32c (or blake2/sha256/xxhash) checksum per block. On read it checks. If there is no match and RAID is present, it reads from another disk and repairs on the fly.

A preventive check:

bash

btrfs scrub start /mnt

btrfs scrub status /mnt

A pass over the whole disk that verifies every block. On SAS/NVMe it runs in the background; on HDDs it is noticeable. Once a week or month from cron is normal.

Compression

fstab

... compress=zstd:1 ...

Transparent compression. Algorithms: zlib, lzo, zstd:1..15. zstd:1 is fast and saves about 30% on typical text data.

Files that are already written can be recompressed:

bash

btrfs filesystem defragment -r -czstd /mnt/data

When something goes wrong

No space left with free GB: btrfs allocates chunks separately for data and metadata; the metadata ran out. btrfs balance start -musage=50 /mnt repacks chunks. On a full filesystem even balance may not start, so delete at least something first.
The database is slow and fragments: COW was not turned off. Run chattr +C on the directory BEFORE the database files are created.
Snapshots ate all the space. Delete old ones: btrfs subvolume delete /mnt/@snapshots/old-*. Space is fully reclaimed after a balance.
parent transid verify failed: corruption. Use btrfs check (diagnostics only); for repair use btrfs check --repair (RISKY!), or better mount -o ro,recovery and rescue the data.
RAID 5/6 lost data after a crash: the known write-hole problem. Use RAID 1/10 for multi-disk.
The disk is 90% full and slow: btrfs does not like being near full. Keep it under 80%.

When to choose btrfs

Use case	Btrfs?
Workstation with auto-snapshots	✓ (openSUSE/Snapper integration)
NAS without deduplication	✗ (take ZFS instead)
NAS with snapshots and RAID 1/10	✓
Database server	✗ (ext4/xfs + LVM snapshot)
VM host with qcow2	✗ or with `chattr +C`
Container host	✓ (native overlayfs or snapshotting)
RAID 5/6 needed in production	✗ (mdraid + ext4/xfs)

Why btrfs

It appeared in 2009 as an answer to ZFS. It gave Linux:

Copy-on-write (COW): a write does not modify the block in place; it writes to a new location and updates the pointer
Subvolume: a named subdirectory you can mount, snapshot, and send on its own
O(1) snapshots: a cheap copy of a subvolume through clever COW
Data + metadata checksums: you see bit rot before it corrupts a file
Native RAID 0/1/10: without mdraid on top
Send/receive: incremental replication

It is the default on openSUSE Tumbleweed and Fedora Workstation 33+. On production servers it shows up less often because of the historical instability of RAID 5/6 and COW fragmentation.

Copy-on-write, the idea

When a block changes:

Non-COW (ext4, xfs): write to the same block
COW (btrfs, ZFS): write to a new block, update the pointer in the tree

Upsides:

O(1) snapshot: just a new name that points at the same blocks
Crash-safe without a journal: there is always a consistent point
Checksums are verified automatically on read

Downsides:

Fragmentation: frequent edits to one file scatter it across the disk. This hits databases (page-level edits) and VM images especially hard.
Space is unpredictable: one file takes more room when a block changes, because the old version may be held by a snapshot.

The fix for databases and VMs is chattr +C (disable COW for specific files):

bash

mkdir /data/postgres

chattr +C /data/postgres        # for all new files inside

cp -a /old/postgres/* /data/postgres/

chattr +C takes effect only when a file is created; it will not apply to existing files without recopying.

Subvolume

A subvolume is a separate tree inside one filesystem. It is not a partition; it is a logical unit of btrfs:

bash

mkfs.btrfs /dev/sdb1

mount /dev/sdb1 /mnt

btrfs subvolume create /mnt/@home

btrfs subvolume create /mnt/@var

btrfs subvolume list /mnt

Advantages:

You can mount a subvolume on its own (mount -o subvol=@home)
Its own quotas, its own snapshots
Isolated ENOSPC boundaries (optional, through quotas)

A common layout:

/mnt/             ← top-level

 ├─ @            ← root /

 ├─ @home        ← /home

 ├─ @var         ← /var

 └─ @snapshots   ← /.snapshots

In fstab:

fstab

UUID=...  /     btrfs  defaults,subvol=@,compress=zstd:1     0 0

UUID=...  /home btrfs  defaults,subvol=@home,compress=zstd:1 0 0

Snapshots

bash

# Snapshot of a subvolume

btrfs subvolume snapshot /mnt/@home /mnt/@snapshots/home-$(date +%F)

# Read-only snapshot (for backups through send)

btrfs subvolume snapshot -r /mnt/@home /mnt/@snapshots/home-ro

# Rollback: rename

btrfs subvolume delete /mnt/@home_broken

btrfs subvolume snapshot /mnt/@snapshots/home-2026-05-01 /mnt/@home

A snapshot is instant and takes 0 space at first. As edits accumulate, the original and the snapshot diverge, and space is spent on the difference.

Send / receive

Incremental replication to another btrfs filesystem:

bash

# Full send of the first snapshot

btrfs send /mnt/@snapshots/home-2026-05-01 | ssh remote 'btrfs receive /backup/'

# Incremental

btrfs send -p /mnt/@snapshots/home-2026-05-01 /mnt/@snapshots/home-2026-05-02 \

  | ssh remote 'btrfs receive /backup/'

It sends only the delta between snapshots, so backups are fast and compact. This is the better alternative to rsync for a large dataset.

RAID

btrfs does RAID by block, not by disk:

bash

mkfs.btrfs -d raid1 -m raid1 /dev/sdb /dev/sdc      # data + metadata mirror

mkfs.btrfs -d raid10 -m raid1 /dev/sd{b,c,d,e}      # data RAID10, meta mirror

mkfs.btrfs -d single -m raid1 /dev/sdb /dev/sdc     # data not mirrored, metadata is

Levels:

single: no redundancy
dup: two copies on one disk (the default for metadata on a single disk)
raid0: stripe with no redundancy
raid1: mirror (but N-way: 2 copies across N+ disks, not like mdraid)
raid1c3, raid1c4: 3-way and 4-way mirror
raid10: stripe of mirrors
raid5, raid6: historically unstable, the write hole. People avoided them until 2024. With newer kernels (6.x) the situation has improved, but multi-disk production setups still run more often on mdraid + ext4/xfs.

Add or remove a disk online:

bash

btrfs device add /dev/sdd /mnt

btrfs balance start -dconvert=raid1 -mconvert=raid1 /mnt

btrfs device remove /dev/sdc /mnt

Checksums and scrub

btrfs stores a crc32c (or blake2/sha256/xxhash) checksum per block. On read it checks. If there is no match and RAID is present, it reads from another disk and repairs on the fly.

A preventive check:

bash

btrfs scrub start /mnt

btrfs scrub status /mnt

A pass over the whole disk that verifies every block. On SAS/NVMe it runs in the background; on HDDs it is noticeable. Once a week or month from cron is normal.

Compression

fstab

... compress=zstd:1 ...

Transparent compression. Algorithms: zlib, lzo, zstd:1..15. zstd:1 is fast and saves about 30% on typical text data.

Files that are already written can be recompressed:

bash

btrfs filesystem defragment -r -czstd /mnt/data

When something goes wrong

No space left with free GB: btrfs allocates chunks separately for data and metadata; the metadata ran out. btrfs balance start -musage=50 /mnt repacks chunks. On a full filesystem even balance may not start, so delete at least something first.
The database is slow and fragments: COW was not turned off. Run chattr +C on the directory BEFORE the database files are created.
Snapshots ate all the space. Delete old ones: btrfs subvolume delete /mnt/@snapshots/old-*. Space is fully reclaimed after a balance.
parent transid verify failed: corruption. Use btrfs check (diagnostics only); for repair use btrfs check --repair (RISKY!), or better mount -o ro,recovery and rescue the data.
RAID 5/6 lost data after a crash: the known write-hole problem. Use RAID 1/10 for multi-disk.
The disk is 90% full and slow: btrfs does not like being near full. Keep it under 80%.

When to choose btrfs

Use case	Btrfs?
Workstation with auto-snapshots	✓ (openSUSE/Snapper integration)
NAS without deduplication	✗ (take ZFS instead)
NAS with snapshots and RAID 1/10	✓
Database server	✗ (ext4/xfs + LVM snapshot)
VM host with qcow2	✗ or with `chattr +C`
Container host	✓ (native overlayfs or snapshotting)
RAID 5/6 needed in production	✗ (mdraid + ext4/xfs)

btrfs: copy-on-write, subvolumes, and snapshots

Why btrfs

Copy-on-write, the idea

Subvolume

Snapshots

Send / receive

RAID

Checksums and scrub

Compression

When something goes wrong

When to choose btrfs

§ команды

§ см. также

btrfs: copy-on-write, subvolumes, and snapshots

Why btrfs

Copy-on-write, the idea

Subvolume

Snapshots

Send / receive

RAID

Checksums and scrub

Compression

When something goes wrong

When to choose btrfs

§ команды

§ см. также