Why XFS
XFS was born at SGI in 1993 for IRIX and large storage. The main differences from ext4:
- Allocation groups: the filesystem is split into N independent regions, and different processes write to different AGs in parallel
- Extent-based allocation: a fragment is described by a single extent (start, length) rather than a list of blocks
- Dynamic inode allocation: inodes are not fixed at mkfs time
- Delayed allocation: blocks are allocated at flush, not at write
- Journals metadata only, and does it aggressively
- Online grow is available, shrink is not
From RHEL 7 (2014) it is the default. It fits databases (PostgreSQL, MySQL), file servers, kvm/qemu disks, and mail spools with large mboxes.
Allocation groups: the key to parallelism
The filesystem is divided into N equal regions (by default 4 per CPU at mkfs time). Each AG is almost an independent mini-filesystem: its own free blocks, its own inodes. This lets you:
- Have several processes write to different AGs at once without locking
- Keep allocation decisions local (nearby inodes lead to nearby blocks)
- Run a fast mkfs (you format the AGs in parallel)
$ mkfs.xfs /dev/sdb1
meta-data=/dev/sdb1 isize=512 agcount=4, agsize=...
You can override it:
mkfs.xfs -d agcount=16 /dev/sdb1
More AGs means more parallelism, but also more overhead. The default is usually fine.
Inodes are dynamic
Unlike ext4 (fixed at mkfs time), XFS allocates inodes as needed. You will never hit "no space, but free GB".
The downsides: inodes are scattered across the filesystem, so a directory walk is slower on an HDD. On SSD this does not matter.
The inode size is 512B by default (it used to be 256B). A large inode is good for inline extended-attributes and ACLs without allocating separate blocks.
mkfs.xfs -i size=1024 /dev/sdb1 # even more room for xattr
Journaling
XFS journals metadata only. Data is protected by delayed allocation plus barriers. The journal lives in a built-in zone (you can move it out):
mkfs.xfs -l logdev=/dev/nvme0n1p1,size=128m /dev/sdb1
mount -o logdev=/dev/nvme0n1p1 /dev/sdb1 /mnt/data
Putting the log on NVMe usually speeds up a heavy metadata load (creating and deleting millions of files).
Mount options
UUID=... /data xfs defaults,noatime,nodiratime,inode64,logbsize=256k 0 0
| Option | What it does |
|---|---|
noatime | as everywhere, turns off atime |
inode64 (default since RHEL7) | inodes can live in any AG; without it they sit only in the lowest 1 TiB |
logbsize=256k | log buffer size; larger is faster on metadata-heavy loads |
largeio | reports the optimal I/O size in st_blksize as a hint to applications |
nobarrier | disables barriers, faster but dangerous on non-battery RAID |
pquota, uquota, gquota | quotas by project/user/group |
Barriers (barrier=1) guarantee that the journal hits the disk
before the data. On a HW RAID with a battery-backed cache you can
run without them (set nobarrier), but then you own the correctness
of the storage stack.
Online grow
# First resize the partition or LV
parted /dev/sdb resizepart 1 100%
# or
lvextend -L+100G /dev/vg/lv-data
# Then XFS picks it up
xfs_growfs /mnt/data
xfs_growfs takes a mountpoint, not a device. The filesystem must
be mounted.
Shrink is impossible. If you need to make it smaller: back up, mkfs, restore. This is the sorest spot of XFS.
xfs_info, xfs_db
xfs_info /mnt/data # structure: AG count, block size, log
xfs_db -r /dev/sdb1 -c "version" # internal details (read-only is safe)
xfs_io -c "stat" /path/to/file # parent inode, extents
xfs_io -c "fiemap" file # extent map
xfs_bmap -v file # alternative to fiemap
xfs_db is a low-level debugger; in RW mode you can break the
filesystem by accident.
xfs_repair vs e2fsck
XFS runs a journal replay automatically at mount. If that is not enough (corruption after a controller crash, bad sectors):
umount /mnt/data
xfs_repair /dev/sdb1 # must be unmounted
xfs_repair -L /dev/sdb1 # force, zeroing the journal (RISK!)
-L is the last resort: it zeroes the log, losing whatever did
not finish committing. Use it only when a normal repair fails with
"log is corrupt".
Unlike [[ext4|e2fsck]], xfs_repair does not patch small things on a healthy filesystem. You either need it or you do not.
Quota
XFS quota has three dimensions:
- uquota: per user
- gquota: per group
- pquota: per project (a group of inodes tagged with one id)
pquota is unique to XFS: you can put a quota on an arbitrary subtree
of directories that is not tied to a user or group:
mount -o pquota /dev/sdb1 /data
echo '42:/data/projects/foo' >> /etc/projects
echo 'foo:42' >> /etc/projid
xfs_quota -x -c 'project -s foo' /data
xfs_quota -x -c 'limit -p bhard=10g foo' /data
XFS vs ext4
| Trait | XFS | ext4 |
|---|---|---|
| Inodes | dynamic | fixed at mkfs |
| Parallel I/O | strong (AG) | weaker |
| Huge filesystems (>16 TiB) | good | worse due to structure |
| Small files | good | good (denser packing) |
| Online resize | grow only | grow + shrink |
| Crash recovery | journal replay | journal replay + extensive fsck |
| RAM for metadata | more | less |
| Default on | RHEL 7+, CentOS, Oracle Linux | Debian/Ubuntu/Mint |
For a root filesystem on a host with no special requirements, both are fine. For a data partition with databases, VMs, or parallel load, pick XFS.
When something goes wrong
xfs_growfs: data size unchanged: the partition or LV is not resized yet. Runparted resizepartorlvextendfirst, thenxfs_growfs.Structure needs cleaningat mount after a crash: runxfs_repair. If it complains about the log, use-L(knowing the risk).- Fragmentation: XFS usually does not fragment on its own, but it
starts to on a near-full filesystem.
xfs_db -c frag /dev/sdb1(read-only only).xfs_fsris the online defrag. - mkfs on NVMe balks about block size 4K: old xfsprogs could not
handle it; update
xfsprogs >= 5.0. - Quota does not count a project: you forgot
mount -o pquota, or you mixed it up withprjquota(the older name). Cannot allocate memoryat mount after a crash: a very large journal; you need more RAM or you have to move the log to a separate disk.