Why it exists
Disk is slow, RAM is fast. The page cache keeps file blocks in memory so that:
- A repeated read of the same file does not touch the disk
- A write returns instantly (writeback happens later)
- mmap-ed files work without an explicit
read()
This is transparent: programs write to ordinary files, and the kernel
caches on its own. Managing it is a large part of the linux-vm subsystem.
free -h: what the columns mean
total used free shared buff/cache available
Mem: 16Gi 4.0Gi 500Mi 100Mi 12Gi 11Gi
Swap: 4Gi 0 4Gi
- used is what processes and the kernel actually hold
- free is used by nobody (usually a small amount!)
- buff/cache is the page cache, available to processes on demand
- available is an estimate of what a new process would really get (roughly used + reclaimable cache)
"Linux ate my RAM" is a myth. When available is large, everything is fine:
the cache is handed back for allocation instantly. Watch available, not free.
Buffers vs Cache
Historically:
- Buffers is a block-level cache (raw device blocks)
- Cache is a filesystem-level cache (files)
On modern kernels the difference has almost vanished, and both land in the
page cache. In free they are summed into buff/cache.
Dirty pages and writeback
When a program writes to a file, the page is marked dirty. The kernel flushes it to disk asynchronously through writeback.
cat /proc/meminfo | grep -E 'Dirty|Writeback'
# Dirty: 12 MB ← modified, not on disk yet
# Writeback: 0 MB ← being written right now
It is controlled by sysctls:
cat /proc/sys/vm/dirty_ratio # % of RAM that FORCES a write (default 20)
cat /proc/sys/vm/dirty_background_ratio # % at which background flushing starts (default 10)
cat /proc/sys/vm/dirty_expire_centisecs # how long a dirty page may wait (default 30s)
A dirty_ratio that is too large gives long fsync stalls during a flush. Too small means more I/O. Production databases often tune this.
sync, fsync, sync()
sync # flush ALL dirty pages of all filesystems (good before a reboot)
Inside programs:
fsync(fd)guarantees the file data is on diskfdatasync(fd)is the same, without metadata (faster)- the
O_SYNCflag onopen()makes every write synchronous (slow)
Databases and journaling systems call fsync on every commit, hence the importance of fast NVMe over HDD for transaction throughput.
drop_caches: free the cache by hand
sync # flush dirty pages first
echo 3 | sudo tee /proc/sys/vm/drop_caches # 1=pagecache 2=dentries+inodes 3=everything
Why: to benchmark cold-cache performance, to force a reload, for debugging. Do NOT do this in production: the first read of a large file afterward becomes slow.
Direct I/O: bypassing the cache
When a program manages caching itself (databases, video players):
open(path, O_DIRECT | O_RDWR, ...);
Reads and writes go straight to disk, past the page cache. The downside is that all sizes and offsets must be aligned to the sector.
Readahead
The kernel sees a sequential read of a file and loads the next blocks
ahead of time, which is readahead:
blockdev --getra /dev/sda # readahead size in 512-byte sectors
blockdev --setra 16384 /dev/sda # increase it (for a streaming workload)
A large readahead helps sequential access and hurts random access (it reads what will not be needed).
fadvise / madvise
A program can give the kernel a hint:
posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED)means "forget this file from the cache" (worth calling after copying large files so they do not evict hot data)POSIX_FADV_SEQUENTIALincreases readaheadPOSIX_FADV_RANDOMdisables readahead
The vmtouch tool evicts or warms the cache by hand for benchmarks.
Page cache in cgroups v2
In a container the page cache is counted per-cgroup. The memory.max limit
includes the cache. So a container with a small memory limit that reads a
large file can be OOM-killed even without heap allocations.