Question 1

Why do you need MVCC? What does it buy over locking on read?

Accepted Answer

MVCC (multiversion concurrency control) keeps several versions of one row
at the same time. A reader sees a snapshot of the data as of the start of
the query or transaction and does not wait for writers, and a writer does
not wait for readers. The core rule: reads do not block writes, writes do
not block reads. The price is that old versions pile up as garbage and
have to be cleaned (that is vacuum's job). The alternative from older
databases, locking a row on read, produces less garbage but turns
concurrent load into a queue.

Question 2

How does PostgreSQL decide whether a row version is visible to the current transaction?

Accepted Answer

A version has `xmin` (who created it) and `xmax` (who deleted or locked
it). A transaction takes a snapshot: its own number, the boundary "every
transaction below this is finished", and the list of still-running
transactions. The version is visible if `xmin` finished successfully and
falls into the snapshot's past, and `xmax` is either empty or belongs to a
transaction that has not finished or was rolled back. A transaction's
status (committed/rolled back) sits in the clog, but checking it every
time is expensive, so the first reader to look sets hint bits in
`t_infomask`, and from then on the answer comes from the row itself.

Question 3

What is a snapshot physically? Is it a copy?

Accepted Answer

A snapshot is not a copy of the data but a small set of numbers: the
boundary below which all transactions are already finished (the snapshot
`xmin`), the boundary above which none have started yet (the snapshot
`xmax`), and an explicit list of the xids that were active when the
snapshot was taken. The visibility of any row version is computed from
these numbers on the fly. That is why a snapshot is cheap. You can take it
instantly and even export it to another session (`pg_export_snapshot`) so
that a parallel `pg_dump` reads a consistent picture.

Question 4

Which isolation levels does PostgreSQL have, and which anomaly does each cut off?

Accepted Answer

The standard describes four levels; PostgreSQL implements three
distinguishable ones: Read Committed (the default), Repeatable Read, and
Serializable. A requested Read Uncommitted behaves as Read Committed, and
dirty reads never happen here. Read Committed takes a new snapshot for
each statement, so non-repeatable reads and phantoms are possible.
Repeatable Read takes one snapshot for the whole transaction, so repeated
reads are stable, but a write anomaly (write skew) is possible.
Serializable adds dependency tracking through predicate locks (SSI) and
guarantees a result as if transactions ran one after another.

Question 5

Repeatable Read versus Serializable: what exactly does SSI catch?

Accepted Answer

Repeatable Read gives a stable snapshot: inside a transaction the data
does not shift under your feet. But two such snapshots can diverge on
writes: each transaction reads one state, both write, and the result is
impossible under any serial order. That is write skew. Serializable adds
SSI (serializable snapshot isolation): the server tracks dangerous cycles
of read-write dependencies through predicate (SIRead) locks and rolls one
transaction back with a serialization error. The guarantee is a result
equivalent to some serial order.

Question 6

Why is an UPDATE in PostgreSQL effectively a new row version? What does it cost?

Accepted Answer

An UPDATE does not edit the row in place: it marks the old version through
`xmax` and lays down a new version with a new `xmin`. The old one lives as
long as any snapshot can see it, then vacuum takes it. Two consequences
follow. First, bloat: heavy UPDATEs breed dead versions faster than they
are cleaned. Second, indexes: by default a new version needs new entries
in every index on the table. HOT update saves you here. If no indexed
column changed and the page has room, the new version stays in the same
page without touching the indexes.

Question 7

What are the clog and hint bits, and why are they needed?

Accepted Answer

A row version records only the id of the creating transaction, not its
outcome. Whether the `xmin` transaction committed or rolled back is held
by the clog (commit log, the `pg_xact` directory), two bits per
transaction. Checking the clog on every read is expensive, so the first
reader to determine the outcome sets hint bits in the row's `t_infomask`:
"xmin committed" or "rolled back". After that visibility is computed
without a clog trip. A side effect: the first SELECT after a bulk insert
"dirties" pages by setting hint bits and produces disk writes even though
the data did not change.

Question 8

How does a virtual xid differ from a real one, and what do subtransactions have to do with it?

Accepted Answer

A real transaction id (xid) is a scarce 32-bit resource, and it is a shame
to spend it on transactions that write nothing. So while a transaction
only reads, it gets a virtual xid (a pair: the backend number plus a local
counter), and the real one is assigned lazily, on the first write.
Subtransactions (a savepoint, a block with exception handling in PL/pgSQL)
also get their own xids; their mapping to the parent is held by
`pg_subtrans`. A rollback to a savepoint marks the subtransaction's
versions invisible without touching the parent.

Snapshots, xmin/xmax, isolation levels