Page cache: disk in memory: page cache

Why it exists

Disk is slow, RAM is fast. The page cache keeps file blocks in memory so that:

A repeated read of the same file does not touch the disk
A write returns instantly (writeback happens later)
mmap-ed files work without an explicit read()

This is transparent: programs write to ordinary files, and the kernel caches on its own. Managing it is a large part of the linux-vm subsystem.

free -h: what the columns mean

              total   used   free   shared  buff/cache  available

Mem:           16Gi  4.0Gi  500Mi   100Mi        12Gi      11Gi

Swap:          4Gi    0     4Gi

used is what processes and the kernel actually hold
free is used by nobody (usually a small amount!)
buff/cache is the page cache, available to processes on demand
available is an estimate of what a new process would really get (roughly used + reclaimable cache)

"Linux ate my RAM" is a myth. When available is large, everything is fine: the cache is handed back for allocation instantly. Watch available, not free.

Buffers vs Cache

Historically:

Buffers is a block-level cache (raw device blocks)
Cache is a filesystem-level cache (files)

On modern kernels the difference has almost vanished, and both land in the page cache. In free they are summed into buff/cache.

Dirty pages and writeback

When a program writes to a file, the page is marked dirty. The kernel flushes it to disk asynchronously through writeback.

bash

cat /proc/meminfo | grep -E 'Dirty|Writeback'

# Dirty:        12 MB    ← modified, not on disk yet

# Writeback:    0 MB     ← being written right now

It is controlled by sysctls:

bash

cat /proc/sys/vm/dirty_ratio              # % of RAM that FORCES a write (default 20)

cat /proc/sys/vm/dirty_background_ratio   # % at which background flushing starts (default 10)

cat /proc/sys/vm/dirty_expire_centisecs   # how long a dirty page may wait (default 30s)

A dirty_ratio that is too large gives long fsync stalls during a flush. Too small means more I/O. Production databases often tune this.

sync, fsync, sync()

bash

sync                            # flush ALL dirty pages of all filesystems (good before a reboot)

Inside programs:

fsync(fd) guarantees the file data is on disk
fdatasync(fd) is the same, without metadata (faster)
the O_SYNC flag on open() makes every write synchronous (slow)

Databases and journaling systems call fsync on every commit, hence the importance of fast NVMe over HDD for transaction throughput.

drop_caches: free the cache by hand

bash

sync                                       # flush dirty pages first

echo 3 | sudo tee /proc/sys/vm/drop_caches # 1=pagecache 2=dentries+inodes 3=everything

Why: to benchmark cold-cache performance, to force a reload, for debugging. Do NOT do this in production: the first read of a large file afterward becomes slow.

Direct I/O: bypassing the cache

When a program manages caching itself (databases, video players):

open(path, O_DIRECT | O_RDWR, ...);

Reads and writes go straight to disk, past the page cache. The downside is that all sizes and offsets must be aligned to the sector.

Readahead

The kernel sees a sequential read of a file and loads the next blocks ahead of time, which is readahead:

bash

blockdev --getra /dev/sda                  # readahead size in 512-byte sectors

blockdev --setra 16384 /dev/sda             # increase it (for a streaming workload)

A large readahead helps sequential access and hurts random access (it reads what will not be needed).

fadvise / madvise

A program can give the kernel a hint:

posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) means "forget this file from the cache" (worth calling after copying large files so they do not evict hot data)
POSIX_FADV_SEQUENTIAL increases readahead
POSIX_FADV_RANDOM disables readahead

The vmtouch tool evicts or warms the cache by hand for benchmarks.

Page cache in cgroups v2

In a container the page cache is counted per-cgroup. The memory.max limit includes the cache. So a container with a small memory limit that reads a large file can be OOM-killed even without heap allocations.

Why it exists

Disk is slow, RAM is fast. The page cache keeps file blocks in memory so that:

A repeated read of the same file does not touch the disk
A write returns instantly (writeback happens later)
mmap-ed files work without an explicit read()

This is transparent: programs write to ordinary files, and the kernel caches on its own. Managing it is a large part of the linux-vm subsystem.

free -h: what the columns mean

              total   used   free   shared  buff/cache  available

Mem:           16Gi  4.0Gi  500Mi   100Mi        12Gi      11Gi

Swap:          4Gi    0     4Gi

used is what processes and the kernel actually hold
free is used by nobody (usually a small amount!)
buff/cache is the page cache, available to processes on demand
available is an estimate of what a new process would really get (roughly used + reclaimable cache)

"Linux ate my RAM" is a myth. When available is large, everything is fine: the cache is handed back for allocation instantly. Watch available, not free.

Buffers vs Cache

Historically:

Buffers is a block-level cache (raw device blocks)
Cache is a filesystem-level cache (files)

On modern kernels the difference has almost vanished, and both land in the page cache. In free they are summed into buff/cache.

Dirty pages and writeback

When a program writes to a file, the page is marked dirty. The kernel flushes it to disk asynchronously through writeback.

bash

cat /proc/meminfo | grep -E 'Dirty|Writeback'

# Dirty:        12 MB    ← modified, not on disk yet

# Writeback:    0 MB     ← being written right now

It is controlled by sysctls:

bash

cat /proc/sys/vm/dirty_ratio              # % of RAM that FORCES a write (default 20)

cat /proc/sys/vm/dirty_background_ratio   # % at which background flushing starts (default 10)

cat /proc/sys/vm/dirty_expire_centisecs   # how long a dirty page may wait (default 30s)

A dirty_ratio that is too large gives long fsync stalls during a flush. Too small means more I/O. Production databases often tune this.

sync, fsync, sync()

bash

sync                            # flush ALL dirty pages of all filesystems (good before a reboot)

Inside programs:

fsync(fd) guarantees the file data is on disk
fdatasync(fd) is the same, without metadata (faster)
the O_SYNC flag on open() makes every write synchronous (slow)

Databases and journaling systems call fsync on every commit, hence the importance of fast NVMe over HDD for transaction throughput.

drop_caches: free the cache by hand

bash

sync                                       # flush dirty pages first

echo 3 | sudo tee /proc/sys/vm/drop_caches # 1=pagecache 2=dentries+inodes 3=everything

Why: to benchmark cold-cache performance, to force a reload, for debugging. Do NOT do this in production: the first read of a large file afterward becomes slow.

Direct I/O: bypassing the cache

When a program manages caching itself (databases, video players):

open(path, O_DIRECT | O_RDWR, ...);

Reads and writes go straight to disk, past the page cache. The downside is that all sizes and offsets must be aligned to the sector.

Readahead

The kernel sees a sequential read of a file and loads the next blocks ahead of time, which is readahead:

bash

blockdev --getra /dev/sda                  # readahead size in 512-byte sectors

blockdev --setra 16384 /dev/sda             # increase it (for a streaming workload)

A large readahead helps sequential access and hurts random access (it reads what will not be needed).

fadvise / madvise

A program can give the kernel a hint:

posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED) means "forget this file from the cache" (worth calling after copying large files so they do not evict hot data)
POSIX_FADV_SEQUENTIAL increases readahead
POSIX_FADV_RANDOM disables readahead

The vmtouch tool evicts or warms the cache by hand for benchmarks.

Page cache: disk in memory

Why it exists

free -h: what the columns mean

Buffers vs Cache

Dirty pages and writeback

sync, fsync, sync()

drop_caches: free the cache by hand

Direct I/O: bypassing the cache

Readahead

fadvise / madvise

Page cache in cgroups v2

§ команды

§ см. также

Page cache: disk in memory

Why it exists

free -h: what the columns mean

Buffers vs Cache

Dirty pages and writeback

sync, fsync, sync()

drop_caches: free the cache by hand

Direct I/O: bypassing the cache

Readahead

fadvise / madvise

Page cache in cgroups v2

§ команды

§ см. также