Question 1

How is the buffer cache built, and why have it if there is an OS file cache?

Accepted Answer

The buffer cache is an area in shared memory common to all backends, sized
by `shared_buffers`, sliced into 8 KB slots. Every page read and write
goes through it: a backend does not touch the file directly but asks the
buffer manager for the page. If the page is already there, that is a hit
and the disk is not touched. The OS cache also exists and works a layer
below, but PostgreSQL's buffer cache knows about MVCC, dirty pages, and
WAL, so it can guarantee the write-ahead rule and not flush to disk what
has not yet been recorded in WAL.

Question 2

How does PostgreSQL choose which page to evict from the buffer cache?

Accepted Answer

Instead of classic LRU it uses clock sweep. Each buffer has a usage
counter: on access it grows (up to a small ceiling), and a special pointer
walks the ring and decrements the counter on each buffer. A buffer at zero
and with no pin becomes the victim. If the victim is dirty (changed since
it was read), it is written to disk first, but only after the matching WAL
record is already there (the write-ahead rule). Hot pages get to raise
their counter and survive a lap, while cold ones are evicted.

Question 3

Why do you need WAL, and what is the write-ahead rule?

Accepted Answer

WAL (write-ahead log) is a sequential log of all page changes. The
rule is simple: the WAL record that a page changed reaches disk before
the changed page itself does. So at commit it is enough to durably write
WAL (one sequential fsync), and the dirty data pages can be flushed lazily
later. If the server crashes, on startup it replays WAL from the last
checkpoint and restores all confirmed changes. That way one sequential
write gives you both durability (the D in ACID) and a fast commit with no
random writes across the whole table.

Question 4

What is an LSN, and how does crash recovery work?

Accepted Answer

An LSN (log sequence number) is a monotonic address of a position in WAL,
essentially an offset into the WAL. Each page stores in its header the
LSN of the last WAL record applied to it. During recovery the server takes
the last checkpoint and replays WAL forward: for each record it compares
its LSN with the page's LSN and applies only what the page has not yet seen
(idempotency by LSN). Reaching the end of the WAL, the database lands
in a consistent state with all confirmed transactions. The same LSNs serve
as positions for streaming replication.

Question 5

What is a full-page image, and why does PostgreSQL write a whole page into WAL?

Accepted Answer

An 8 KB page is not written to disk atomically: on a crash during the
write you can get a half page (a torn page), part old and part new. So
that such a page can be recovered, on the first change after a checkpoint
PostgreSQL writes its full copy into WAL, a full-page image (FPI). After
that ordinary deltas follow until the next checkpoint resets the counter
again. This is governed by `full_page_writes` (on by default). FPIs are the
main reason WAL swells right after a checkpoint and why frequent
checkpoints increase the WAL volume.

Question 6

What does a checkpoint do, and how does its tuning affect load?

Accepted Answer

A checkpoint flushes to disk all dirty buffers accumulated up to some LSN
and writes a mark into WAL: "everything up to this position is already in
the data files". This shortens the WAL that will have to be replayed
on recovery. It triggers by time (`checkpoint_timeout`) or by WAL
volume (`max_wal_size`). To avoid a write spike, the flush is spread over
time by `checkpoint_completion_target`. Too-frequent checkpoints swell WAL
through FPIs and load the disk; too-rare ones lengthen recovery and pile up
dirty buffers. You balance between recovery speed and a smooth write.

Question 7

What WAL levels are there, and why raise them?

Accepted Answer

The WAL level sets how much information is written to the WAL.
`minimal` writes only what is needed for crash recovery on this same
server; some bulk operations under it may skip full WAL logging. `replica`
(the default) adds the data for streaming replication and archive recovery
(PITR), which is enough for physical replicas. `logical` writes even more:
enough to decode changes at the row level for logical replication and CDC.
The higher the level, the larger the WAL volume, so you raise it for
exactly the scenario you need.

Question 8

What is a ring buffer, and why is it used for large sequential operations?

Accepted Answer

If you let a large `SELECT` over a table bigger than the cache fill
`shared_buffers`, it would evict the entire hot working set. To prevent
that, large sequential scans, `COPY`, and vacuum get a small ring of
buffers: the operation spins inside a few hundred kilobytes and does not
knock out other hot pages. A related but separate mechanism is
synchronized scans (`synchronize_seqscans`): concurrent seq scans of one
table align their start position so they read nearby pages while those are
still hot in the cache, which is why a scan can start somewhere other than
the beginning of the file.

Buffer cache, WAL, checkpoints