Back to clusters

Buffer cache, WAL, checkpoints

How PostgreSQL keeps hot pages in memory and still survives a sudden crash: the buffer cache and its eviction, the write-ahead log (WAL), full-page images, checkpoints, WAL levels. Here they check whether you understand the difference between "written to memory" and "guaranteed to survive a crash".

8 questions · ~35 min read

#buffer-cache-basics

juniorOften

How is the buffer cache built, and why have it if there is an OS file cache?

What to say

The buffer cache is an area in shared memory common to all backends, sized by `shared_buffers`, sliced into 8 KB slots. Every page read and write goes through it: a backend does not touch the file directly but asks the buffer manager for the page. If the page is already there, that is a hit and the disk is not touched. The OS cache also exists and works a layer below, but PostgreSQL's buffer cache knows about MVCC, dirty pages, and WAL, so it can guarantee the write-ahead rule and not flush to disk what has not yet been recorded in WAL.

What they want to hear

A senior should: - describe the cache as shared memory of 8 KB buffers through which all page access goes - explain why a separate cache over the OS cache: control of dirty pages and enforcing the WAL rule - understand double caching: one page can sit in both `shared_buffers` and the OS page cache - give a sizing guide (around a quarter of RAM as a start) and say why "more" is not always better

Pitfalls

  • Setting `shared_buffers` to nearly all memory. The OS cache is needed too, and checkpoints start writing in a flood
  • Thinking a backend reads the file directly. Everything goes through the buffer manager
  • Confusing a buffer cache hit with an OS cache hit. They are two different layers

Follow-up

  • ? Why is `shared_buffers` usually not set to 90% of memory?
  • ? How does `pg_buffercache` show what is in the cache right now?
  • ? What is double caching, and why is it bad?

Depth in knowledge base

tags: buffers, cache, memorybook: postgresql_internals-17.pdf:ch9 buffer cache

#buffer-eviction-clocksweep

intermediateSometimes

How does PostgreSQL choose which page to evict from the buffer cache?

What to say

Instead of classic LRU it uses clock sweep. Each buffer has a usage counter: on access it grows (up to a small ceiling), and a special pointer walks the ring and decrements the counter on each buffer. A buffer at zero and with no pin becomes the victim. If the victim is dirty (changed since it was read), it is written to disk first, but only after the matching WAL record is already there (the write-ahead rule). Hot pages get to raise their counter and survive a lap, while cold ones are evicted.

What they want to hear

A senior should: - describe clock sweep: a usage counter plus a pointer that decays it around the ring - explain why WAL is written before evicting a dirty page (the write-ahead rule) - tie it to performance: when the cache is short, backends themselves start evicting and writing, which is where the dips come from - mention the bgwriter, which writes dirty buffers out ahead of time so backends have to do it less often

Pitfalls

  • Calling the algorithm LRU. In PostgreSQL it is clock sweep with a usage counter
  • Forgetting the WAL rule: a dirty buffer cannot be flushed before its WAL record
  • Thinking only the bgwriter does eviction. Under pressure ordinary backends do it too

Follow-up

  • ? Why can a dirty buffer not be written before its WAL record?
  • ? What does the background bgwriter process do?
  • ? What does "a buffer is pinned" mean, and who pins it?

Depth in knowledge base

tags: buffers, eviction, bgwriterbook: postgresql_internals-17.pdf:ch9 buffer cache

#why-wal

juniorOften

Why do you need WAL, and what is the write-ahead rule?

What to say

WAL (write-ahead log) is a sequential log of all page changes. The rule is simple: the WAL record that a page changed reaches disk before the changed page itself does. So at commit it is enough to durably write WAL (one sequential fsync), and the dirty data pages can be flushed lazily later. If the server crashes, on startup it replays WAL from the last checkpoint and restores all confirmed changes. That way one sequential write gives you both durability (the D in ACID) and a fast commit with no random writes across the whole table.

What they want to hear

A senior should: - state the write-ahead rule: the WAL to disk before the data page - explain why this is fast: a sequential WAL write instead of random page writes at every commit - tie it to recovery: after a crash WAL is replayed from the checkpoint - distinguish the WAL fsync at commit (mandatory for durability) from the deferred flush of data pages

Pitfalls

  • Thinking the data pages are written to disk at commit. WAL is written, and the pages are flushed later
  • Confusing WAL with a query log or a replication journal. It is a log of physical page changes
  • Assuming `synchronous_commit=off` loses nothing. It risks the last transactions for speed

Follow-up

  • ? What happens on startup after a sudden power loss?
  • ? What does `synchronous_commit=off` risk?
  • ? Why is a sequential WAL write cheaper than flushing pages at every commit?

Depth in knowledge base

tags: wal, durability, recoverybook: postgresql_internals-17.pdf:ch10 wal

#wal-lsn-recovery

intermediateOften

What is an LSN, and how does crash recovery work?

What to say

An LSN (log sequence number) is a monotonic address of a position in WAL, essentially an offset into the WAL. Each page stores in its header the LSN of the last WAL record applied to it. During recovery the server takes the last checkpoint and replays WAL forward: for each record it compares its LSN with the page's LSN and applies only what the page has not yet seen (idempotency by LSN). Reaching the end of the WAL, the database lands in a consistent state with all confirmed transactions. The same LSNs serve as positions for streaming replication.

What they want to hear

A senior should: - define an LSN as a position in WAL and say it is also written into the page header - describe redo from the checkpoint and the comparison of the record's LSN with the page's LSN as protection against reapplying - tie LSN to replication: a standby pulls WAL by LSN and reports how far it applied - understand that recovery rolls changes forward, while rolling back unfinished transactions is handled by MVCC (their versions are simply invisible)

Pitfalls

  • Thinking recovery explicitly rolls back unfinished transactions. Their versions just stay invisible under MVCC
  • Assuming redo applies every record in a row. A record with an LSN below the page's LSN is skipped
  • Confusing an LSN with a transaction id. It is a position in the WAL, not an xid

Follow-up

  • ? How does a page's LSN protect against reapplying a WAL record?
  • ? From what point does WAL replay begin during recovery?
  • ? How is the LSN used in streaming replication?

Depth in knowledge base

tags: wal, lsn, recoverybook: postgresql_internals-17.pdf:ch10 wal

#full-page-images

seniorSometimes

What is a full-page image, and why does PostgreSQL write a whole page into WAL?

What to say

An 8 KB page is not written to disk atomically: on a crash during the write you can get a half page (a torn page), part old and part new. So that such a page can be recovered, on the first change after a checkpoint PostgreSQL writes its full copy into WAL, a full-page image (FPI). After that ordinary deltas follow until the next checkpoint resets the counter again. This is governed by `full_page_writes` (on by default). FPIs are the main reason WAL swells right after a checkpoint and why frequent checkpoints increase the WAL volume.

What they want to hear

A senior should: - explain the torn page problem: an 8 KB write is not atomic on a crash - say that an FPI is written on the first change to a page after a checkpoint and cures the partial write - tie checkpoint frequency to WAL volume: more checkpoints mean more FPIs mean more WAL - know when `full_page_writes` can be turned off (a filesystem or device with atomic page writes) and why it is risky

Pitfalls

  • Turning off `full_page_writes` without a guarantee of atomic page writes from storage. A direct risk of corruption on a crash
  • Being surprised by a WAL volume spike right after a checkpoint. That is the FPIs
  • Making checkpoints very frequent for faster recovery, forgetting that they swell WAL through FPIs

Follow-up

  • ? What is a torn page, and why is it possible?
  • ? Why does the WAL volume jump right after a checkpoint?
  • ? Under what conditions is it safe to turn off `full_page_writes`?

Depth in knowledge base

tags: wal, fpi, durabilitybook: postgresql_internals-17.pdf:ch10 wal

#checkpoints

intermediateOften

What does a checkpoint do, and how does its tuning affect load?

What to say

A checkpoint flushes to disk all dirty buffers accumulated up to some LSN and writes a mark into WAL: "everything up to this position is already in the data files". This shortens the WAL that will have to be replayed on recovery. It triggers by time (`checkpoint_timeout`) or by WAL volume (`max_wal_size`). To avoid a write spike, the flush is spread over time by `checkpoint_completion_target`. Too-frequent checkpoints swell WAL through FPIs and load the disk; too-rare ones lengthen recovery and pile up dirty buffers. You balance between recovery speed and a smooth write.

What they want to hear

A senior should: - describe a checkpoint as flushing dirty buffers plus a WAL mark that sets the start of future recovery - name both triggers: time and WAL volume - explain the trade-off: frequent checkpoints mean more FPIs and writes, rare ones mean longer recovery - know about spreading the write (`checkpoint_completion_target`) and the diagnostics through the checkpoint log and `pg_stat_bgwriter`

Pitfalls

  • Setting a very small `max_wal_size`. Checkpoints get frequent and flood the disk through FPIs
  • Ignoring `checkpoint_completion_target`. The flush bunches into a spike and causes a dip
  • Confusing a checkpoint with vacuum. A checkpoint is about durability and recovery, not garbage

Follow-up

  • ? Why are too-frequent checkpoints harmful?
  • ? Why spread the flush through `checkpoint_completion_target`?
  • ? How do you tell from the logs that checkpoints run by volume rather than the timer?

Depth in knowledge base

tags: wal, checkpoint, tuningbook: postgresql_internals-17.pdf:ch11 wal modes

#wal-levels

intermediateSometimes

What WAL levels are there, and why raise them?

What to say

The WAL level sets how much information is written to the WAL. `minimal` writes only what is needed for crash recovery on this same server; some bulk operations under it may skip full WAL logging. `replica` (the default) adds the data for streaming replication and archive recovery (PITR), which is enough for physical replicas. `logical` writes even more: enough to decode changes at the row level for logical replication and CDC. The higher the level, the larger the WAL volume, so you raise it for exactly the scenario you need.

What they want to hear

A senior should: - list the minimal/replica/logical levels and what each one unlocks - tie `replica` to physical replicas and PITR, and `logical` to logical replication and change data capture - understand the cost: WAL volume grows as the level rises - know that changing the level requires a restart and is planned ahead against the replication plan

Pitfalls

  • Setting `logical` without need. Extra WAL volume for no benefit
  • Expecting a streaming replica on `minimal`. Its data is not enough for a standby
  • Thinking the level changes on the fly. It needs a server restart

Follow-up

  • ? Which level is needed for a streaming replica, and which for a logical one?
  • ? Why can `minimal` not be used for a standby?
  • ? What exactly does the `logical` level add to the WAL?

Depth in knowledge base

tags: wal, wal-level, replicationbook: postgresql_internals-17.pdf:ch11 wal modes

#ring-buffer-seqscan

seniorRare

What is a ring buffer, and why is it used for large sequential operations?

What to say

If you let a large `SELECT` over a table bigger than the cache fill `shared_buffers`, it would evict the entire hot working set. To prevent that, large sequential scans, `COPY`, and vacuum get a small ring of buffers: the operation spins inside a few hundred kilobytes and does not knock out other hot pages. A related but separate mechanism is synchronized scans (`synchronize_seqscans`): concurrent seq scans of one table align their start position so they read nearby pages while those are still hot in the cache, which is why a scan can start somewhere other than the beginning of the file.

What they want to hear

A senior should: - explain the ring's goal: keep a large operation from washing the hot set out of the cache - name the operations that get a ring: large seq scans, COPY, vacuum - describe scan synchronization as a separate mechanism: concurrent scans align their start position to reuse hot pages - understand the ring is a trade-off: less harm to the cache at the cost of possible repeat reads

Pitfalls

  • Thinking any query freely fills the whole cache. Large sequential operations are bounded by the ring
  • Being surprised a parallel scan starts in the middle of a table. That is scan synchronization
  • Confusing the ring with the shared cache. It is a small area dedicated to one operation

Follow-up

  • ? Which operations does PostgreSQL give a ring buffer?
  • ? Why do two simultaneous scans of one table save reads?
  • ? What is the downside of the ring compared with free use of the cache?

Depth in knowledge base

tags: buffers, ring-buffer, seqscanbook: postgresql_internals-17.pdf:ch9 buffer cache