Q: Where does a table physically live in the cluster directory, and how do you find it?

Inside `PGDATA` the main data sits in `base/ / `. Each database is its own subdirectory by `oid`, and each relation is files by `relfilenode`. A large fork is sliced into 1 GB segments. Tablespaces move specific objects to another disk: then instead of `base/` a directory under `pg_tblspc/` is used. The easiest way to find the path is `pg_relation_filepath('name')`, and sizes come from `pg_relation_size` and `pg_total_relation_size` (the latter counts indexes and TOAST too).

Question 1

What lives inside an 8 KB page? Name the parts and which way each one grows.

Accepted Answer

A page is the unit of reads and writes, 8 KB by default. It has four
zones. The header (`PageHeaderData`, 24 bytes): checksum, the LSN of the
last WAL record, and the `pd_lower` and `pd_upper` pointers. The array of
line pointers grows down from the start. The tuples themselves are placed
from the end upward. The special space at the very end holds service data
for indexes and is empty in a heap. Free space is the gap between
`pd_lower` and `pd_upper`; once that gap collapses, nothing more fits in
the page.

Question 2

What does a tuple header store? List the fields and what each is for.

Accepted Answer

Ahead of the user data, every tuple carries a `HeapTupleHeader`, 23 bytes
plus alignment. The main fields: `t_xmin`, the id of the transaction that
created the version; `t_xmax`, the id of the transaction that deleted or
locked it (0 if alive); `t_ctid`, a pointer to the next version of this
row (for the UPDATE chain) or to itself; `t_infomask` and `t_infomask2`,
status bits (whether xmin/xmax committed, whether there are NULLs, HOT,
and so on); and `t_hoff`, the offset where the header with its NULL bitmap
ends and the data begins.

Question 3

What is ctid, and why can't you treat it as a stable row identifier?

Accepted Answer

`ctid` is the physical address of a tuple: a pair of `(page number, item
number)`. It tells you exactly where a version currently sits, and it is
handy within a single query. But any UPDATE creates a new version with a
new `ctid`, while the old one stays until cleanup. After `VACUUM FULL`,
`CLUSTER`, or even ordinary cleanup with defragmentation, addresses move
around. So you cannot store `ctid` in the application as a row key. The
primary key is for that.

Question 4

What is TOAST, when does it kick in, and what are its storage strategies?

Accepted Answer

A tuple has to fit in an 8 KB page, yet values can be longer. TOAST (The
Oversized-Attribute Storage Technique) moves long fields out of line: it
first tries to compress, and if the value is still large, it slices it
into chunks stored in a service TOAST table, leaving a pointer in the row.
The threshold is about 2 KB per row (`TOAST_TUPLE_THRESHOLD`). The
per-column strategies: `plain` (leave alone, only for short types),
`extended` (compress and move out when needed, the default for
`text`/`jsonb`), `external` (move out without compression), and `main`
(compress, move out as a last resort).

Question 5

Why does column order affect the on-disk size of a row?

Accepted Answer

Fixed-length fields in a tuple are aligned to their own boundary:
`bigint` and `double` to 8 bytes, `int` to 4, `smallint` to 2. If a
`bigint` follows a `boolean` (1 byte) directly, 7 bytes of padding are
inserted so the `bigint` lands on an address divisible by 8. Group the
wide fields up front and the narrow ones (`bool`, `smallint`) at the tail,
and there are fewer holes and the row takes fewer bytes. On a table with
hundreds of millions of rows that is real gigabytes and extra pages to
read.

Question 6

What are the forks of a relation, and how does relfilenode differ from oid?

Accepted Answer

On disk a relation is not a single file but several layers (forks). The
main fork holds the data pages themselves. The FSM (free space map) tracks
free space per page. The VM (visibility map) is a bitmap of "all versions
in the page are visible to everyone" and "all are frozen". The init fork
is an empty template for unlogged tables. The file names come from
`relfilenode`, not `oid`: `oid` is the object's stable identifier in the
catalog, while `relfilenode` is the name of the current set of files.
Commands like `TRUNCATE`, `VACUUM FULL`, and `REINDEX` change
`relfilenode` while leaving `oid` the same.

Question 7

Where does a table physically live in the cluster directory, and how do you find it?

Accepted Answer

Inside `PGDATA` the main data sits in `base//`. Each database is its own subdirectory by `oid`, and each relation is files by `relfilenode`. A large fork is sliced into 1 GB segments. Tablespaces move specific objects to another disk: then instead of `base/` a directory under `pg_tblspc/` is used. The easiest way to find the path is `pg_relation_filepath('name')`, and sizes come from `pg_relation_size` and `pg_total_relation_size` (the latter counts indexes and TOAST too).

Page, tuple, TOAST, relation files