Back to clusters

Page, tuple, TOAST, relation files

How PostgreSQL lays data out on disk: the 8 KB page, the tuple header, line pointers, TOAST for long values, field alignment, and relation files. This is the foundation. Without it, any talk of MVCC, vacuum, and indexes hangs in the air.

7 questions · ~35 min read

#page-8kb-anatomy

juniorOften

What lives inside an 8 KB page? Name the parts and which way each one grows.

What to say

A page is the unit of reads and writes, 8 KB by default. It has four zones. The header (`PageHeaderData`, 24 bytes): checksum, the LSN of the last WAL record, and the `pd_lower` and `pd_upper` pointers. The array of line pointers grows down from the start. The tuples themselves are placed from the end upward. The special space at the very end holds service data for indexes and is empty in a heap. Free space is the gap between `pd_lower` and `pd_upper`; once that gap collapses, nothing more fits in the page.

What they want to hear

A senior should: - name the four zones and say that pointers and tuples grow toward each other, with free space as the gap between them - explain why the pointer layer exists: a tuple can move within a page (after cleanup), yet its external address `(page, item)` stays the same - know that 8 KB is a compile-time setting (`BLCKSZ`) and cannot change without a rebuild - mention `pageinspect` (`page_header`, `heap_page_items`) as the way to see all of this on a live page

Pitfalls

  • Saying "a tuple sits at a fixed offset". The offset is held by the line pointer, and the tuple itself can move
  • Confusing in-page free space with the FSM. The FSM is a separate map, while the `pd_lower..pd_upper` gap is space in one specific page
  • Thinking 8 KB can be changed in the config. It is the build setting `BLCKSZ`, not a GUC

Follow-up

  • ? What does `SELECT * FROM page_header(get_raw_page('t', 0))` show?
  • ? Why does `pd_lower` grow on INSERT while tuples come from the other end?
  • ? Why does a page carry a checksum, and when is it verified?

Depth in knowledge base

tags: storage, page, heapbook: postgresql_internals-17.pdf:ch1 data organization · postgresql_internals-17.pdf:ch3 pages and tuples

#tuple-header-fields

intermediateOften

What does a tuple header store? List the fields and what each is for.

What to say

Ahead of the user data, every tuple carries a `HeapTupleHeader`, 23 bytes plus alignment. The main fields: `t_xmin`, the id of the transaction that created the version; `t_xmax`, the id of the transaction that deleted or locked it (0 if alive); `t_ctid`, a pointer to the next version of this row (for the UPDATE chain) or to itself; `t_infomask` and `t_infomask2`, status bits (whether xmin/xmax committed, whether there are NULLs, HOT, and so on); and `t_hoff`, the offset where the header with its NULL bitmap ends and the data begins.

What they want to hear

A senior should: - tie `t_xmin`/`t_xmax` to visibility: these plus the snapshot decide whether the current transaction sees the version - explain hint bits in `t_infomask`: they cache "xmin committed" and save a repeat trip to the clog - say that `t_ctid` points at the next version, building the chain of versions on UPDATE - know that NULLs are stored not as a value but as a bitmap right after the fixed part of the header, which is why `t_hoff` floats

Pitfalls

  • Saying "xmax means deletion" and forgetting that xmax is also set when a row is locked by `SELECT FOR UPDATE`
  • Thinking a NULL takes up value space. It is marked by a bit in the null bitmap, with no data behind it
  • Treating the header as fixed. Its length `t_hoff` depends on the presence of a null bitmap and an OID

Follow-up

  • ? What in `t_infomask` distinguishes a committed transaction from one still running?
  • ? Where does `t_ctid` point for the most recent version of a row?
  • ? Why do two different INSERTs into one row yield a different `t_hoff`?

Depth in knowledge base

tags: storage, tuple, mvccbook: postgresql_internals-17.pdf:ch3 pages and tuples

#ctid-stability

intermediateSometimes

What is ctid, and why can't you treat it as a stable row identifier?

What to say

`ctid` is the physical address of a tuple: a pair of `(page number, item number)`. It tells you exactly where a version currently sits, and it is handy within a single query. But any UPDATE creates a new version with a new `ctid`, while the old one stays until cleanup. After `VACUUM FULL`, `CLUSTER`, or even ordinary cleanup with defragmentation, addresses move around. So you cannot store `ctid` in the application as a row key. The primary key is for that.

What they want to hear

A senior should: - state clearly that `ctid` addresses a version, not a logical row, and changes on every UPDATE - give the legitimate uses: deduplication by `ctid` within a query, pinpoint deletion of duplicates, manual "where is the row physically" diagnostics - explain why `WHERE ctid = ...` across transactions is unsafe: cleanup and moves make the address invalid - tie it to HOT: a HOT chain lives within a page, and an old version's `ctid` leads to the new one through `t_ctid`

Pitfalls

  • Caching `ctid` in the application and updating by it later. After a vacuum you hit the wrong row or nothing
  • Assuming `ctid` is monotonic or reflects insert order. It is not. Cleanup reuses pointer slots
  • Confusing `ctid` (a version address) with `oid` or the primary key (a logical identifier)

Follow-up

  • ? How do you delete duplicate rows, keeping one of each, using `ctid`?
  • ? Why do all `ctid` values in a table change after `VACUUM FULL`?
  • ? How does `ctid` differ from the system `oid`?

Depth in knowledge base

tags: storage, ctid, mvccbook: postgresql_internals-17.pdf:ch3 pages and tuples · postgresql_internals-17.pdf:ch5 hot updates

#toast-mechanics

intermediateOften

What is TOAST, when does it kick in, and what are its storage strategies?

What to say

A tuple has to fit in an 8 KB page, yet values can be longer. TOAST (The Oversized-Attribute Storage Technique) moves long fields out of line: it first tries to compress, and if the value is still large, it slices it into chunks stored in a service TOAST table, leaving a pointer in the row. The threshold is about 2 KB per row (`TOAST_TUPLE_THRESHOLD`). The per-column strategies: `plain` (leave alone, only for short types), `extended` (compress and move out when needed, the default for `text`/`jsonb`), `external` (move out without compression), and `main` (compress, move out as a last resort).

What they want to hear

A senior should: - explain the two independent steps, compression and moving out of line, and that the strategy controls both - understand why `external` helps substring and prefix lookups: without compression you can read chunks without decompressing the whole value - know that each table has its own TOAST table, which is also subject to vacuum and its own bloat - tie it to performance: a long field is not read from TOAST until a SELECT touches it (deferred de-toasting)

Pitfalls

  • Thinking TOAST is turned on by hand. It works on its own as soon as a row does not fit in a page
  • Forgetting that frequent UPDATEs of a large jsonb create garbage in both the main table and the TOAST table
  • Assuming `SELECT count(*)` reads TOAST. Long values are not de-toasted until you ask for them

Follow-up

  • ? How does the `external` strategy differ from `extended` for a `text` column?
  • ? Where does the TOAST table for a table named `documents` physically live?
  • ? Why does a long `jsonb` under frequent UPDATEs bloat twice over?

Depth in knowledge base

tags: storage, toast, jsonbbook: postgresql_internals-17.pdf:ch1 data organization

#column-order-padding

seniorSometimes

Why does column order affect the on-disk size of a row?

What to say

Fixed-length fields in a tuple are aligned to their own boundary: `bigint` and `double` to 8 bytes, `int` to 4, `smallint` to 2. If a `bigint` follows a `boolean` (1 byte) directly, 7 bytes of padding are inserted so the `bigint` lands on an address divisible by 8. Group the wide fields up front and the narrow ones (`bool`, `smallint`) at the tail, and there are fewer holes and the row takes fewer bytes. On a table with hundreds of millions of rows that is real gigabytes and extra pages to read.

What they want to hear

A senior should: - explain the cause: CPU alignment of types, not a PostgreSQL quirk - give the practical rule: order columns by descending alignment width, with variable-length ones (`text`, `numeric`) at the end - measure the effect through `pageinspect` (`lp_len`) or `pg_column_size`, not by eye - understand the win is not only disk: fewer bytes per row means more rows per page, which means fewer pages in the buffer cache and in reads

Pitfalls

  • Treating padding as a micro-optimization. On large tables it is tens of percent of the size
  • Thinking that reordering columns in an existing table shrinks it. You need a rewrite or `VACUUM FULL`/`CLUSTER`
  • Forgetting about `NUMERIC` and `text`: their alignment is 4 or 1, and the rule for them is different

Follow-up

  • ? How do you measure one row's size before and after reordering columns?
  • ? Why is `(int, bigint, int)` heavier than `(bigint, int, int)`?
  • ? Does reordering columns change anything for rows already written?

Depth in knowledge base

tags: storage, alignment, paddingbook: postgresql_internals-17.pdf:ch3 pages and tuples

#relfilenode-forks

intermediateSometimes

What are the forks of a relation, and how does relfilenode differ from oid?

What to say

On disk a relation is not a single file but several layers (forks). The main fork holds the data pages themselves. The FSM (free space map) tracks free space per page. The VM (visibility map) is a bitmap of "all versions in the page are visible to everyone" and "all are frozen". The init fork is an empty template for unlogged tables. The file names come from `relfilenode`, not `oid`: `oid` is the object's stable identifier in the catalog, while `relfilenode` is the name of the current set of files. Commands like `TRUNCATE`, `VACUUM FULL`, and `REINDEX` change `relfilenode` while leaving `oid` the same.

What they want to hear

A senior should: - name the main/fsm/vm/init forks and say that the fsm and vm appear not right away but after the first cleanup - distinguish `oid` (logical, in `pg_class`) from `relfilenode` (physical, the file name), and give the operations that desync them - know about segmentation: a fork's file is split into 1 GB pieces (`.1`, `.2`, ...) - be able to find the file: `pg_relation_filepath('t')` returns the path from the cluster directory

Pitfalls

  • Assuming a table's file name equals its `oid`. After `VACUUM FULL` that is no longer true
  • Thinking the FSM and VM always exist. A freshly created table has neither until its first vacuum
  • Forgetting about 1 GB segments and looking for one table in one file

Follow-up

  • ? What does `pg_relation_filepath('orders')` return, and what is the path made of?
  • ? Why does a new table have no `_fsm` file, while one appears after `VACUUM`?
  • ? Which operations change `relfilenode` without touching `oid`?

Depth in knowledge base

tags: storage, forks, relfilenodebook: postgresql_internals-17.pdf:ch1 data organization

#where-table-on-disk

juniorSometimes

Where does a table physically live in the cluster directory, and how do you find it?

What to say

Inside `PGDATA` the main data sits in `base/<database oid>/<relfilenode>`. Each database is its own subdirectory by `oid`, and each relation is files by `relfilenode`. A large fork is sliced into 1 GB segments. Tablespaces move specific objects to another disk: then instead of `base/` a directory under `pg_tblspc/` is used. The easiest way to find the path is `pg_relation_filepath('name')`, and sizes come from `pg_relation_size` and `pg_total_relation_size` (the latter counts indexes and TOAST too).

What they want to hear

A senior should: - describe the `base/<db>/<relfilenode>` hierarchy and the role of tablespaces - distinguish `pg_relation_size` (main fork only), `pg_table_size` (with TOAST and fsm/vm), and `pg_total_relation_size` (plus indexes) - understand that one relation is many files: segments plus forks - not poke at cluster files by hand on a running server. Reads and writes go through the buffer cache, and direct edits wreck integrity

Pitfalls

  • Assuming one table is one file. It is a set of segments and forks
  • Confusing the database `oid` (a directory) with the relation `relfilenode` (a file)
  • Measuring with `pg_relation_size` and being surprised it is smaller than reality. It excludes indexes and TOAST

Follow-up

  • ? How does `pg_total_relation_size` differ from `pg_relation_size`?
  • ? Where do the files move if you relocate a table to another tablespace?
  • ? Why can't you edit files in `base/` on a live server?

Depth in knowledge base

tags: storage, pgdata, tablespacebook: postgresql_internals-17.pdf:ch1 data organization