XFS: extents and parallel I/O: XFS | LinuxLab

Why XFS

XFS was born at SGI in 1993 for IRIX and large storage. The main differences from ext4:

Allocation groups: the filesystem is split into N independent regions, and different processes write to different AGs in parallel
Extent-based allocation: a fragment is described by a single extent (start, length) rather than a list of blocks
Dynamic inode allocation: inodes are not fixed at mkfs time
Delayed allocation: blocks are allocated at flush, not at write
Journals metadata only, and does it aggressively
Online grow is available, shrink is not

From RHEL 7 (2014) it is the default. It fits databases (PostgreSQL, MySQL), file servers, kvm/qemu disks, and mail spools with large mboxes.

Allocation groups: the key to parallelism

The filesystem is divided into N equal regions (by default 4 per CPU at mkfs time). Each AG is almost an independent mini-filesystem: its own free blocks, its own inodes. This lets you:

Have several processes write to different AGs at once without locking
Keep allocation decisions local (nearby inodes lead to nearby blocks)
Run a fast mkfs (you format the AGs in parallel)

bash

$ mkfs.xfs /dev/sdb1

meta-data=/dev/sdb1   isize=512   agcount=4, agsize=...

You can override it:

bash

mkfs.xfs -d agcount=16 /dev/sdb1

More AGs means more parallelism, but also more overhead. The default is usually fine.

Inodes are dynamic

Unlike ext4 (fixed at mkfs time), XFS allocates inodes as needed. You will never hit "no space, but free GB".

The downsides: inodes are scattered across the filesystem, so a directory walk is slower on an HDD. On SSD this does not matter.

The inode size is 512B by default (it used to be 256B). A large inode is good for inline extended-attributes and ACLs without allocating separate blocks.

bash

mkfs.xfs -i size=1024 /dev/sdb1   # even more room for xattr

Journaling

XFS journals metadata only. Data is protected by delayed allocation plus barriers. The journal lives in a built-in zone (you can move it out):

bash

mkfs.xfs -l logdev=/dev/nvme0n1p1,size=128m /dev/sdb1

mount -o logdev=/dev/nvme0n1p1 /dev/sdb1 /mnt/data

Putting the log on NVMe usually speeds up a heavy metadata load (creating and deleting millions of files).

Mount options

fstab

UUID=...  /data  xfs  defaults,noatime,nodiratime,inode64,logbsize=256k  0 0

Option	What it does
`noatime`	as everywhere, turns off atime
`inode64` (default since RHEL7)	inodes can live in any AG; without it they sit only in the lowest 1 TiB
`logbsize=256k`	log buffer size; larger is faster on metadata-heavy loads
`largeio`	reports the optimal I/O size in st_blksize as a hint to applications
`nobarrier`	disables barriers, faster but dangerous on non-battery RAID
`pquota`, `uquota`, `gquota`	quotas by project/user/group

Barriers (barrier=1) guarantee that the journal hits the disk before the data. On a HW RAID with a battery-backed cache you can run without them (set nobarrier), but then you own the correctness of the storage stack.

Online grow

bash

# First resize the partition or LV

parted /dev/sdb resizepart 1 100%

# or

lvextend -L+100G /dev/vg/lv-data

# Then XFS picks it up

xfs_growfs /mnt/data

xfs_growfs takes a mountpoint, not a device. The filesystem must be mounted.

Shrink is impossible. If you need to make it smaller: back up, mkfs, restore. This is the sorest spot of XFS.

xfs_info, xfs_db

bash

xfs_info /mnt/data       # structure: AG count, block size, log

xfs_db -r /dev/sdb1 -c "version"   # internal details (read-only is safe)

xfs_io -c "stat" /path/to/file     # parent inode, extents

xfs_io -c "fiemap" file            # extent map

xfs_bmap -v file                   # alternative to fiemap

xfs_db is a low-level debugger; in RW mode you can break the filesystem by accident.

xfs_repair vs e2fsck

XFS runs a journal replay automatically at mount. If that is not enough (corruption after a controller crash, bad sectors):

bash

umount /mnt/data

xfs_repair /dev/sdb1                  # must be unmounted

xfs_repair -L /dev/sdb1               # force, zeroing the journal (RISK!)

-L is the last resort: it zeroes the log, losing whatever did not finish committing. Use it only when a normal repair fails with "log is corrupt".

Unlike [[ext4|e2fsck]], xfs_repair does not patch small things on a healthy filesystem. You either need it or you do not.

Quota

XFS quota has three dimensions:

uquota: per user
gquota: per group
pquota: per project (a group of inodes tagged with one id)

pquota is unique to XFS: you can put a quota on an arbitrary subtree of directories that is not tied to a user or group:

bash

mount -o pquota /dev/sdb1 /data

echo '42:/data/projects/foo' >> /etc/projects

echo 'foo:42' >> /etc/projid

xfs_quota -x -c 'project -s foo' /data

xfs_quota -x -c 'limit -p bhard=10g foo' /data

XFS vs ext4

Trait	XFS	ext4
Inodes	dynamic	fixed at mkfs
Parallel I/O	strong (AG)	weaker
Huge filesystems (>16 TiB)	good	worse due to structure
Small files	good	good (denser packing)
Online resize	grow only	grow + shrink
Crash recovery	journal replay	journal replay + extensive fsck
RAM for metadata	more	less
Default on	RHEL 7+, CentOS, Oracle Linux	Debian/Ubuntu/Mint

For a root filesystem on a host with no special requirements, both are fine. For a data partition with databases, VMs, or parallel load, pick XFS.

When something goes wrong

xfs_growfs: data size unchanged: the partition or LV is not resized yet. Run parted resizepart or lvextend first, then xfs_growfs.
Structure needs cleaning at mount after a crash: run xfs_repair. If it complains about the log, use -L (knowing the risk).
Fragmentation: XFS usually does not fragment on its own, but it starts to on a near-full filesystem. xfs_db -c frag /dev/sdb1 (read-only only). xfs_fsr is the online defrag.
mkfs on NVMe balks about block size 4K: old xfsprogs could not handle it; update xfsprogs >= 5.0.
Quota does not count a project: you forgot mount -o pquota, or you mixed it up with prjquota (the older name).
Cannot allocate memory at mount after a crash: a very large journal; you need more RAM or you have to move the log to a separate disk.

Why XFS

XFS was born at SGI in 1993 for IRIX and large storage. The main differences from ext4:

Allocation groups: the filesystem is split into N independent regions, and different processes write to different AGs in parallel
Extent-based allocation: a fragment is described by a single extent (start, length) rather than a list of blocks
Dynamic inode allocation: inodes are not fixed at mkfs time
Delayed allocation: blocks are allocated at flush, not at write
Journals metadata only, and does it aggressively
Online grow is available, shrink is not

From RHEL 7 (2014) it is the default. It fits databases (PostgreSQL, MySQL), file servers, kvm/qemu disks, and mail spools with large mboxes.

Allocation groups: the key to parallelism

The filesystem is divided into N equal regions (by default 4 per CPU at mkfs time). Each AG is almost an independent mini-filesystem: its own free blocks, its own inodes. This lets you:

Have several processes write to different AGs at once without locking
Keep allocation decisions local (nearby inodes lead to nearby blocks)
Run a fast mkfs (you format the AGs in parallel)

bash

$ mkfs.xfs /dev/sdb1

meta-data=/dev/sdb1   isize=512   agcount=4, agsize=...

You can override it:

bash

mkfs.xfs -d agcount=16 /dev/sdb1

More AGs means more parallelism, but also more overhead. The default is usually fine.

Inodes are dynamic

Unlike ext4 (fixed at mkfs time), XFS allocates inodes as needed. You will never hit "no space, but free GB".

The downsides: inodes are scattered across the filesystem, so a directory walk is slower on an HDD. On SSD this does not matter.

The inode size is 512B by default (it used to be 256B). A large inode is good for inline extended-attributes and ACLs without allocating separate blocks.

bash

mkfs.xfs -i size=1024 /dev/sdb1   # even more room for xattr

Journaling

XFS journals metadata only. Data is protected by delayed allocation plus barriers. The journal lives in a built-in zone (you can move it out):

bash

mkfs.xfs -l logdev=/dev/nvme0n1p1,size=128m /dev/sdb1

mount -o logdev=/dev/nvme0n1p1 /dev/sdb1 /mnt/data

Putting the log on NVMe usually speeds up a heavy metadata load (creating and deleting millions of files).

Mount options

fstab

UUID=...  /data  xfs  defaults,noatime,nodiratime,inode64,logbsize=256k  0 0

Option	What it does
`noatime`	as everywhere, turns off atime
`inode64` (default since RHEL7)	inodes can live in any AG; without it they sit only in the lowest 1 TiB
`logbsize=256k`	log buffer size; larger is faster on metadata-heavy loads
`largeio`	reports the optimal I/O size in st_blksize as a hint to applications
`nobarrier`	disables barriers, faster but dangerous on non-battery RAID
`pquota`, `uquota`, `gquota`	quotas by project/user/group

Online grow

bash

# First resize the partition or LV

parted /dev/sdb resizepart 1 100%

# or

lvextend -L+100G /dev/vg/lv-data

# Then XFS picks it up

xfs_growfs /mnt/data

xfs_growfs takes a mountpoint, not a device. The filesystem must be mounted.

Shrink is impossible. If you need to make it smaller: back up, mkfs, restore. This is the sorest spot of XFS.

xfs_info, xfs_db

bash

xfs_info /mnt/data       # structure: AG count, block size, log

xfs_db -r /dev/sdb1 -c "version"   # internal details (read-only is safe)

xfs_io -c "stat" /path/to/file     # parent inode, extents

xfs_io -c "fiemap" file            # extent map

xfs_bmap -v file                   # alternative to fiemap

xfs_db is a low-level debugger; in RW mode you can break the filesystem by accident.

xfs_repair vs e2fsck

XFS runs a journal replay automatically at mount. If that is not enough (corruption after a controller crash, bad sectors):

bash

umount /mnt/data

xfs_repair /dev/sdb1                  # must be unmounted

xfs_repair -L /dev/sdb1               # force, zeroing the journal (RISK!)

-L is the last resort: it zeroes the log, losing whatever did not finish committing. Use it only when a normal repair fails with "log is corrupt".

Unlike [[ext4|e2fsck]], xfs_repair does not patch small things on a healthy filesystem. You either need it or you do not.

Quota

XFS quota has three dimensions:

uquota: per user
gquota: per group
pquota: per project (a group of inodes tagged with one id)

pquota is unique to XFS: you can put a quota on an arbitrary subtree of directories that is not tied to a user or group:

bash

mount -o pquota /dev/sdb1 /data

echo '42:/data/projects/foo' >> /etc/projects

echo 'foo:42' >> /etc/projid

xfs_quota -x -c 'project -s foo' /data

xfs_quota -x -c 'limit -p bhard=10g foo' /data

XFS vs ext4

Trait	XFS	ext4
Inodes	dynamic	fixed at mkfs
Parallel I/O	strong (AG)	weaker
Huge filesystems (>16 TiB)	good	worse due to structure
Small files	good	good (denser packing)
Online resize	grow only	grow + shrink
Crash recovery	journal replay	journal replay + extensive fsck
RAM for metadata	more	less
Default on	RHEL 7+, CentOS, Oracle Linux	Debian/Ubuntu/Mint

For a root filesystem on a host with no special requirements, both are fine. For a data partition with databases, VMs, or parallel load, pick XFS.

When something goes wrong

xfs_growfs: data size unchanged: the partition or LV is not resized yet. Run parted resizepart or lvextend first, then xfs_growfs.
Structure needs cleaning at mount after a crash: run xfs_repair. If it complains about the log, use -L (knowing the risk).
Fragmentation: XFS usually does not fragment on its own, but it starts to on a near-full filesystem. xfs_db -c frag /dev/sdb1 (read-only only). xfs_fsr is the online defrag.
mkfs on NVMe balks about block size 4K: old xfsprogs could not handle it; update xfsprogs >= 5.0.
Quota does not count a project: you forgot mount -o pquota, or you mixed it up with prjquota (the older name).
Cannot allocate memory at mount after a crash: a very large journal; you need more RAM or you have to move the log to a separate disk.

XFS: extents and parallel I/O

Why XFS

Allocation groups: the key to parallelism

Inodes are dynamic

Journaling

Mount options

Online grow

xfs_info, xfs_db

xfs_repair vs e2fsck

Quota

XFS vs ext4

When something goes wrong

§ команды

§ см. также

XFS: extents and parallel I/O

Why XFS

Allocation groups: the key to parallelism

Inodes are dynamic

Journaling

Mount options

Online grow

xfs_info, xfs_db

xfs_repair vs e2fsck

Quota

XFS vs ext4

When something goes wrong

§ команды

§ см. также