linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
  • PostgreSQL internals
    Page and tuple, MVCC, vacuum, WAL, the planner and indexes
    →
All tutorials →
PricingAboutSign inCreate account
/
Intro
Lessons
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
  • PostgreSQL internals
    Page and tuple, MVCC, vacuum, WAL, the planner and indexes
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Chapters
  • How it works
  • Lessons
  • Knowledge base
  • Interview prep
Cluster

Back to clusters

Snapshots, xmin/xmax, isolation levels

How PostgreSQL keeps readers and writers out of each other's way: row versions, the snapshot as a set of numbers, the clog and hint bits, and the four isolation levels with their anomalies. This is where candidates fail most often, because MVCC is easy to learn in words and hard to explain "how exactly".

8 questions · ~40 min read

Questions

On this page

  1. 01Why do you need MVCC? What does it buy over locking on read?
  2. 02How does PostgreSQL decide whether a row version is visible to the current transaction?
  3. 03What is a snapshot physically? Is it a copy?
  4. 04Which isolation levels does PostgreSQL have, and which anomaly does each cut off?
  5. 05Repeatable Read versus Serializable: what exactly does SSI catch?
  6. 06Why is an UPDATE in PostgreSQL effectively a new row version? What does it cost?
  7. 07What are the clog and hint bits, and why are they needed?
  8. 08How does a virtual xid differ from a real one, and what do subtransactions have to do with it?

#why-mvcc

juniorOften

Why do you need MVCC? What does it buy over locking on read?

What to say

MVCC (multiversion concurrency control) keeps several versions of one row at the same time. A reader sees a snapshot of the data as of the start of the query or transaction and does not wait for writers, and a writer does not wait for readers. The core rule: reads do not block writes, writes do not block reads. The price is that old versions pile up as garbage and have to be cleaned (that is vacuum's job). The alternative from older databases, locking a row on read, produces less garbage but turns concurrent load into a queue.

What they want to hear

A candidate should: - state the rule "readers do not block writers and the reverse" and understand that this is the whole point of multiversioning - name the flip side: every old version lives until cleanup, so MVCC and vacuum are one topic - tie a snapshot to a point in time: what a transaction sees depends on the isolation level and on when the snapshot was taken - not confuse MVCC with "no locks": row and relation locks are still there, MVCC only removes the read versus write conflict

Pitfalls

  • ✗ Saying "PostgreSQL has no locks thanks to MVCC". Locks exist. MVCC only removes the read versus write conflict
  • ✗ Forgetting the price: versions accumulate, and without vacuum the table bloats
  • ✗ Thinking a reader always sees the freshest data. It sees its snapshot, not the latest state

Follow-up

  • ? Why does a long-running reading transaction hinder garbage cleanup?
  • ? What still blocks in PostgreSQL despite MVCC?
  • ? Where does table bloat come from under an UPDATE-heavy load?

Depth in knowledge base

  • Why multiversioning
  • Data snapshot
  • xmin, xmax, and the visibility rule
tags: mvcc, concurrency, snapshotbook: postgresql_internals-17.pdf:part1 isolation and mvcc

#xmin-xmax-visibility

intermediateOften

How does PostgreSQL decide whether a row version is visible to the current transaction?

What to say

A version has `xmin` (who created it) and `xmax` (who deleted or locked it). A transaction takes a snapshot: its own number, the boundary "every transaction below this is finished", and the list of still-running transactions. The version is visible if `xmin` finished successfully and falls into the snapshot's past, and `xmax` is either empty or belongs to a transaction that has not finished or was rolled back. A transaction's status (committed/rolled back) sits in the clog, but checking it every time is expensive, so the first reader to look sets hint bits in `t_infomask`, and from then on the answer comes from the row itself.

What they want to hear

A senior should: - describe a snapshot as a triple: the snapshot `xmin`, the snapshot `xmax`, and the list of active xids, not as a copy of the data - run a version through the visibility rule: the creator is committed and in the past, the deleter is absent or invisible - explain the role of the clog and hint bits: the clog holds a transaction's outcome, hint bits cache it in the row to remove repeat lookups - understand that the same row can have several versions visible to different transactions at once

Pitfalls

  • ✗ Treating a snapshot as a copy of the table. It is a few numbers and a list of xids plus visibility rules
  • ✗ Thinking a transaction's status is stored in the row from the start. It is in the clog and reaches the row later through hint bits
  • ✗ Forgetting that a live row can carry an `xmax`. Someone simply locked it with `FOR UPDATE`

Follow-up

  • ? What does the snapshot seen by `pg_current_snapshot()` consist of?
  • ? Why do you need hint bits if the status is already in the clog?
  • ? Can a row with a non-empty `xmax` be visible? When?

Depth in knowledge base

  • xmin, xmax, and the visibility rule
  • Data snapshot
  • clog and hint bits
tags: mvcc, visibility, snapshotbook: postgresql_internals-17.pdf:ch4 snapshots

#snapshot-is-not-copy

intermediateOften

What is a snapshot physically? Is it a copy?

What to say

A snapshot is not a copy of the data but a small set of numbers: the boundary below which all transactions are already finished (the snapshot `xmin`), the boundary above which none have started yet (the snapshot `xmax`), and an explicit list of the xids that were active when the snapshot was taken. The visibility of any row version is computed from these numbers on the fly. That is why a snapshot is cheap. You can take it instantly and even export it to another session (`pg_export_snapshot`) so that a parallel `pg_dump` reads a consistent picture.

What they want to hear

A senior should: - debunk the naive model "a snapshot copies rows": it copies numbers, not data - explain how a specific version's visibility comes out of the snapshot - give snapshot export as a practice: parallel `pg_dump`, consistent reads across several backends - tie a snapshot to the horizon: while a snapshot is alive, versions in its past cannot count as garbage, and vacuum will not touch them

Pitfalls

  • ✗ Saying "a snapshot is an MVCC copy of the table". It is a few numbers plus a list of active xids
  • ✗ Thinking a snapshot is expensive. It is nearly free. Holding it for a long time is what costs
  • ✗ Not tying a snapshot to the cleanup horizon. A long-lived snapshot is exactly what holds garbage

Follow-up

  • ? Why does a parallel `pg_dump` need `pg_export_snapshot`?
  • ? Why does a long-lived snapshot defer cleanup?
  • ? How does a Read Committed snapshot differ from a Repeatable Read one?

Depth in knowledge base

  • Data snapshot
  • Exporting a snapshot between sessions
  • xmin, xmax, and the visibility rule
tags: mvcc, snapshot, isolationbook: postgresql_internals-17.pdf:ch4 snapshots

#isolation-levels-pg

intermediateOften

Which isolation levels does PostgreSQL have, and which anomaly does each cut off?

What to say

The standard describes four levels; PostgreSQL implements three distinguishable ones: Read Committed (the default), Repeatable Read, and Serializable. A requested Read Uncommitted behaves as Read Committed, and dirty reads never happen here. Read Committed takes a new snapshot for each statement, so non-repeatable reads and phantoms are possible. Repeatable Read takes one snapshot for the whole transaction, so repeated reads are stable, but a write anomaly (write skew) is possible. Serializable adds dependency tracking through predicate locks (SSI) and guarantees a result as if transactions ran one after another.

What they want to hear

A senior should: - name the three real levels and say that PostgreSQL's Read Uncommitted equals Read Committed - tie a level to snapshot timing: per statement (RC) or per transaction (RR/Serializable) - list the anomalies in order: non-repeatable read, phantom, write skew, and which level cuts off which - know that Serializable can roll a transaction back with a serialization error, and the application must be ready to retry it

Pitfalls

  • ✗ Saying "PostgreSQL has dirty reads at Read Uncommitted". It does not. The level collapses into Read Committed
  • ✗ Assuming Repeatable Read catches write skew. It does not. That needs Serializable
  • ✗ Forgetting the retry: Serializable throws `serialization_failure`, and the transaction must be repeated

Follow-up

  • ? How does a Read Committed snapshot differ from Repeatable Read in timing?
  • ? What is write skew, and why does Repeatable Read miss it?
  • ? How should an application react to a serialization error?

Depth in knowledge base

  • Isolation levels in PostgreSQL
  • Data snapshot
  • Serializable and SSI
tags: mvcc, isolation, serializablebook: postgresql_internals-17.pdf:ch2 isolation · postgresql_internals-17.pdf:ch4 snapshots

#repeatable-read-vs-serializable

seniorSometimes

Repeatable Read versus Serializable: what exactly does SSI catch?

What to say

Repeatable Read gives a stable snapshot: inside a transaction the data does not shift under your feet. But two such snapshots can diverge on writes: each transaction reads one state, both write, and the result is impossible under any serial order. That is write skew. Serializable adds SSI (serializable snapshot isolation): the server tracks dangerous cycles of read-write dependencies through predicate (SIRead) locks and rolls one transaction back with a serialization error. The guarantee is a result equivalent to some serial order.

What they want to hear

A senior should: - give the classic write skew (for example two people grab the last booking, each sees it free) and show why RR lets it through - explain SSI as dependency tracking, not as locking everything on read - know about the SIReadLock: light predicate locks that do not wait but only mark what was read - understand the price: Serializable produces false positives and needs retries, but it removes a whole class of races without manual locks

Pitfalls

  • ✗ Thinking Serializable locks rows on read. It tracks dependencies, it does not hold locks
  • ✗ Assuming Repeatable Read is enough for invariants across rows. Write skew will slip through
  • ✗ Forgetting the retry loop: without it Serializable just fails transactions with errors

Follow-up

  • ? Give an example of write skew that Serializable catches and Repeatable Read misses
  • ? What is a SIReadLock, and why does it cause no waits?
  • ? Why does Serializable sometimes roll a transaction back even when the rows did not overlap?

Depth in knowledge base

  • Serializable and SSI
  • Isolation levels in PostgreSQL
  • Lightweight, spin, and predicate locks
tags: mvcc, serializable, ssibook: postgresql_internals-17.pdf:part3 locks

#update-is-delete-insert

intermediateOften

Why is an UPDATE in PostgreSQL effectively a new row version? What does it cost?

What to say

An UPDATE does not edit the row in place: it marks the old version through `xmax` and lays down a new version with a new `xmin`. The old one lives as long as any snapshot can see it, then vacuum takes it. Two consequences follow. First, bloat: heavy UPDATEs breed dead versions faster than they are cleaned. Second, indexes: by default a new version needs new entries in every index on the table. HOT update saves you here. If no indexed column changed and the page has room, the new version stays in the same page without touching the indexes.

What they want to hear

A senior should: - describe an UPDATE as a delete-plus-insert at the version level, with `xmax` set on the old one - tie this to bloat and the load on vacuum - explain HOT: the condition (no indexed columns change, room in the page) and the benefit (no new index entries) - give the practice: a `fillfactor` below 100 leaves room for HOT, while frequent UPDATEs of indexed columns kill HOT

Pitfalls

  • ✗ Saying "an UPDATE changes the row in place". It creates a new version, and the old one stays until cleanup
  • ✗ Thinking HOT always works. It falls away if an indexed column changes or the page runs out of room
  • ✗ Ignoring `fillfactor`: on a table with heavy UPDATEs it directly drives the HOT ratio

Follow-up

  • ? Under what conditions does an UPDATE go through as HOT?
  • ? Why lower `fillfactor` for a table with frequent updates?
  • ? Why is an UPDATE of one unindexed column cheaper than an indexed one?

Depth in knowledge base

  • xmin, xmax, and the visibility rule
  • HOT updates and fillfactor
  • In-page cleanup (prune)
tags: mvcc, update, hotbook: postgresql_internals-17.pdf:ch5 hot updates

#clog-hint-bits

seniorSometimes

What are the clog and hint bits, and why are they needed?

What to say

A row version records only the id of the creating transaction, not its outcome. Whether the `xmin` transaction committed or rolled back is held by the clog (commit log, the `pg_xact` directory), two bits per transaction. Checking the clog on every read is expensive, so the first reader to determine the outcome sets hint bits in the row's `t_infomask`: "xmin committed" or "rolled back". After that visibility is computed without a clog trip. A side effect: the first SELECT after a bulk insert "dirties" pages by setting hint bits and produces disk writes even though the data did not change.

What they want to hear

A senior should: - separate the roles: the clog holds the transaction outcome, hint bits cache it in the row itself - explain why the first read after a big load suddenly writes to disk (setting hint bits dirties pages) - tie it to the performance of "cold" tables and to why a repeat run of the same query is faster - understand that the clog is also trimmed: after freezing, old entries are no longer needed and are removed

Pitfalls

  • ✗ Thinking a transaction's outcome is in the row from the start. At first it is only in the clog, and it reaches the row later
  • ✗ Being surprised by disk writes from a `SELECT`. Those are hint bits, not a data change
  • ✗ Not knowing the clog is bounded and trimmed by freezing. That is part of the point of freeze

Follow-up

  • ? Why does the first `SELECT` after a million-row `COPY` write to disk?
  • ? Where does the clog physically sit, and how many bits per transaction does it use?
  • ? How does freezing let the clog be trimmed?

Depth in knowledge base

  • clog and hint bits
  • xmin, xmax, and the visibility rule
  • Freezing and relfrozenxid
tags: mvcc, clog, hint-bitsbook: postgresql_internals-17.pdf:ch4 snapshots

#virtual-vs-real-xid

seniorRare

How does a virtual xid differ from a real one, and what do subtransactions have to do with it?

What to say

A real transaction id (xid) is a scarce 32-bit resource, and it is a shame to spend it on transactions that write nothing. So while a transaction only reads, it gets a virtual xid (a pair: the backend number plus a local counter), and the real one is assigned lazily, on the first write. Subtransactions (a savepoint, a block with exception handling in PL/pgSQL) also get their own xids; their mapping to the parent is held by `pg_subtrans`. A rollback to a savepoint marks the subtransaction's versions invisible without touching the parent.

What they want to hear

A senior should: - explain why a virtual xid exists: it saves the 32-bit counter and eases wraparound pressure from reading transactions - say that the real xid is handed out lazily on the first write - describe subtransactions: savepoints and exception blocks breed child xids, with the parent link in `pg_subtrans` - warn about the anti-pattern: thousands of savepoints or exceptions in a loop bloat `pg_subtrans` and hurt performance

Pitfalls

  • ✗ Thinking every transaction immediately consumes an xid. A reading one gets by with a virtual one
  • ✗ Not knowing that a `BEGIN ... EXCEPTION` block in PL/pgSQL is a subtransaction with its own xid
  • ✗ Ignoring the cost of mass savepoints. Overflowing the subtransaction cache hurts speed

Follow-up

  • ? When does a reading transaction get a real xid?
  • ? Why can a loop with an `EXCEPTION` on every iteration drop in speed?
  • ? What does `pg_subtrans` hold, and why?

Depth in knowledge base

  • Virtual and real xids
  • Subtransactions and savepoints
  • xmin, xmax, and the visibility rule
tags: mvcc, xid, subtransactionsbook: postgresql_internals-17.pdf:ch4 snapshots
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.