how/mvcc

MVCC: one row, two versions

An UPDATE in PostgreSQL never overwrites a row. It writes a new version and stamps the old one with the transaction that retired it. For a while the same row exists twice on the page, and two sessions can read two different truths. Here is how, and how VACUUM cleans up afterward.

The first surprising thing about PostgreSQL is that UPDATE does not change a row in place. It writes a new version of the row and leaves the old one on the page, marked as superseded. The same goes for DELETE (no new version, just a mark) and for every row an INSERT creates.

This is MVCC, multi-version concurrency control. Each version carries two transaction stamps in its header (see tuple-header):

  • xmin - the transaction that created this version;
  • xmax - the transaction that retired it (updated, deleted, or locked it). xmax = 0 means the version is still live.

A transaction decides what it may see through a snapshot (see snapshot): a frozen view of which transactions had committed when it started. The same bytes on disk look different to different snapshots, and that is exactly why readers never block writers. Press play to follow one row from a single version all the way to a vacuumed page.

step 1/5·00 · one row, one version
table mv · page 0v1 · ctid (0,1)xmin 786xmax 0note 'a'live · xmax 0a reading sessionsees the only versionSELECT note FROM mv;'a'readsone row, one version: xmin 786 created it, xmax 0 means live

§ steps

  1. Start with a single row in a table mv, inserted by transaction 786:

    sql
    SELECT xmin, xmax, ctid, * FROM mv;
    --  xmin | xmax | ctid  | id | note
    -- ------+------+-------+----+------
    --   786 |    0 | (0,1) |  1 | a

    One version, at ctid (0,1). Its xmin is 786 (the transaction that created it) and its xmax is 0, which means nobody has retired it: the row is live. This is what every reader sees.

recap

What to remember:

  • UPDATE is delete plus insert under the hood: a new version with a fresh xmin, and the old version stamped with xmax = <updating txid>. Both live on the page until cleanup. See why-mvcc for the design reason.
  • Visibility is computed, not stored. A version is visible if its xmin is committed and inside your snapshot, and its xmax is empty or not yet visible to you. The full rules are in xmin-xmax.
  • Two snapshots can legitimately read two different values of the same row at the same wall-clock moment. Nothing is corrupt; they are looking through different time frames.
  • A version no live snapshot can see anymore is a dead tuple. It still takes space on the page until VACUUM removes it and frees its line pointer (see vacuum).
  • VACUUM can only remove versions older than the oldest snapshot still running, the transaction horizon (see transaction-horizon). One long transaction holds the horizon back and lets dead tuples pile up across the whole database, which is where table bloat comes from.
  • When the update fits on the same page and no index covers the changed column, Postgres uses a HOT update (see hot-updates) to chain the versions without touching the indexes.

The payoff: readers never block writers and writers never block readers. The cost is dead tuples, and collecting them is the whole job of VACUUM.

§ dig into the knowledge base

§ try it hands-on