Q: Nested loop, hash join, merge join: when does the planner pick each?

Nested loop: for each row of the outer set, find matches in the inner one; it pays off when there are few outer rows and the inner access is cheap (usually an index). Hash join: a hash table is built in memory from the smaller table, and the larger one probes it; good for large sets with no useful order, but it needs `work_mem` for the hash and only works on equality. Merge join: both inputs are sorted (or arrive already sorted by an index) and merged like two ordered lists; it pays off on large sets when the sort is cheap or the order already exists. Cost makes the choice: sizes, the presence of indexes, available memory.

Q: Why does the planner sometimes take a seq scan instead of an index that seems to exist?

Index access is cheaper only at low selectivity, when the condition picks a small fraction of rows. Each row found by the index usually needs a random read of a heap page, and random reads are expensive. When a condition passes, say, a third of the table, it is cheaper to read it all in sequence (a seq scan) than to make hundreds of thousands of random index lookups. There is a crossover by row fraction beyond which a seq scan wins on cost. So on broad conditions the planner deliberately ignores the index, and it is right to. The crossover is shifted by `random_page_cost`, data correlation, and the availability of an index-only scan.

Q: How does the planner choose the join order, and what is GEQO?

The number of join order variants grows factorially with the number of tables. Up to a small threshold the planner exhaustively enumerates them with dynamic programming and finds the optimum. When there are many tables (more than `geqo_threshold`, 12 by default), full enumeration gets too expensive, and the genetic optimizer GEQO kicks in: it finds a good but not guaranteed best order in reasonable time. The number of considered variants is affected by `join_collapse_limit` and `from_collapse_limit`: they set how deeply subqueries and explicit JOINs are unfolded into a common pool for enumeration. An explicit JOIN order at a low limit is fixed as written.

Q: When are the ordinary statistics not enough and you need extended ones?

Ordinary statistics are collected per column and assume the columns are independent. When columns correlate, that assumption breaks. The classic case: `city` and `region`. For `WHERE city='Moscow' AND region='Moscow Oblast'` the planner estimates it as the product of two selectivities and badly underrates the row count, because these conditions nearly duplicate each other. Extended statistics (`CREATE STATISTICS`) teach the planner the dependencies: the `dependencies` type catches functional dependencies, `ndistinct` the number of distinct combinations, `mcv` the common value combinations. After it, the cardinality estimate on correlated columns becomes close to reality.

Q: How do you read EXPLAIN ANALYZE, and what do you look at first?

EXPLAIN shows the plan with estimates, and EXPLAIN ANALYZE also runs the query, adding the actual numbers. The main diagnostic move is comparing the estimated `rows` with the `actual rows` on each node. A large gap (an order of magnitude or more) means the planner missed the cardinality and the plan is built on false numbers. Next you look at `loops` (a node in a nested loop runs many times), `Rows Removed by Filter` (read a lot, throw a lot away, an index is asking for it), and `Buffers` (how many pages were actually read). You hunt the bottleneck from the bottom up: the first large estimation error or the most expensive node by fact.

Question 1

What stages does a query go through from text to result?

Accepted Answer

Four stages. Parse: the text becomes a tree, and the syntax and object
names are checked against the system catalog. Rewrite: rules and views are
applied. A reference to a view, for instance, is expanded into a subquery,
and row security (RLS) is layered on. Plan: the cost-based optimizer tries
access methods and join orders and picks the cheapest plan. Execute: the
tree of plan nodes runs in a "pull a row from the top" model (volcano),
each node asking its children for rows one at a time. Splitting into stages
is why a prepared statement can be planned once and executed many times.

Question 2

What is cost in a plan, what is it made of, and is it time?

Accepted Answer

Cost is not time but a dimensionless estimate of work in conditional
units. The base bricks are set by parameters: `seq_page_cost` (reading a
page in sequence, the anchor at 1.0), `random_page_cost` (a random read,
4.0 by default), `cpu_tuple_cost`, `cpu_index_tuple_cost`,
`cpu_operator_cost`. A node's cost is the number of pages times the page
cost plus the number of rows times the per-row cost. A seq scan, for
example, is roughly `relpages * seq_page_cost + reltuples *
cpu_tuple_cost`. The optimizer compares the total costs of the variants and
takes the smallest. So on an SSD it makes sense to lower
`random_page_cost`, since a random read there is almost like a sequential
one.

Question 3

How does startup cost differ from total cost, and what does LIMIT have to do with it?

Accepted Answer

Each plan node has two estimates: startup cost, the work before the node
returns its first row, and total cost, the work up to the last row. A seq
scan has a near-zero startup: it returns the first row right away. A sort
or a hash aggregate has a high startup: until all input rows are read,
there is no first output row. This matters for `LIMIT`: with `LIMIT 10` the
optimizer looks not at total but at the cost of getting the first rows, and
a plan with a cheap start (for example an index scan in the needed order)
can beat a plan with a cheap total but an expensive start (a seq scan plus
a sort).

Question 4

Where does the planner get its row-count estimate? What is selectivity?

Accepted Answer

The planner relies on statistics that `ANALYZE` collects and `pg_statistic`
holds (readable through `pg_stats`). For a column that is the fraction of
NULLs, the number of distinct values (`n_distinct`), the list of the most
common values (MCV) with their frequencies, a histogram of bounds for the
rest, plus the correlation of physical order with logical order.
Selectivity is the fraction of rows that will pass a condition, a number
from 0 to 1. For `= 'x'` it takes the frequency from MCV or `1/n_distinct`;
for `>`/`<` it is the fraction by the histogram. A node's cardinality is
selectivity times the row count. An estimation error low in the plan
propagates upward and ruins the choice of joins.

Question 5

Nested loop, hash join, merge join: when does the planner pick each?

Accepted Answer

Nested loop: for each row of the outer set, find matches in the inner one;
it pays off when there are few outer rows and the inner access is cheap
(usually an index). Hash join: a hash table is built in memory from the
smaller table, and the larger one probes it; good for large sets with no
useful order, but it needs `work_mem` for the hash and only works on
equality. Merge join: both inputs are sorted (or arrive already sorted by
an index) and merged like two ordered lists; it pays off on large sets when
the sort is cheap or the order already exists. Cost makes the choice:
sizes, the presence of indexes, available memory.

Question 6

Why does the planner sometimes take a seq scan instead of an index that seems to exist?

Accepted Answer

Index access is cheaper only at low selectivity, when the condition picks a
small fraction of rows. Each row found by the index usually needs a random
read of a heap page, and random reads are expensive. When a condition
passes, say, a third of the table, it is cheaper to read it all in sequence
(a seq scan) than to make hundreds of thousands of random index lookups.
There is a crossover by row fraction beyond which a seq scan wins on cost.
So on broad conditions the planner deliberately ignores the index, and it
is right to. The crossover is shifted by `random_page_cost`, data
correlation, and the availability of an index-only scan.

Question 7

How does the planner choose the join order, and what is GEQO?

Accepted Answer

The number of join order variants grows factorially with the number of
tables. Up to a small threshold the planner exhaustively enumerates them
with dynamic programming and finds the optimum. When there are many tables (more
than `geqo_threshold`, 12 by default), full enumeration gets too expensive,
and the genetic optimizer GEQO kicks in: it finds a good but not guaranteed
best order in reasonable time. The number of considered variants is
affected by `join_collapse_limit` and `from_collapse_limit`: they set how
deeply subqueries and explicit JOINs are unfolded into a common pool for
enumeration. An explicit JOIN order at a low limit is fixed as written.

Question 8

When are the ordinary statistics not enough and you need extended ones?

Accepted Answer

Ordinary statistics are collected per column and assume the columns are
independent. When columns correlate, that assumption breaks. The classic
case: `city` and `region`. For `WHERE city='Moscow' AND region='Moscow
Oblast'` the planner estimates it as the product of two selectivities and
badly underrates the row count, because these conditions nearly duplicate
each other. Extended statistics (`CREATE STATISTICS`) teach the planner the
dependencies: the `dependencies` type catches functional dependencies,
`ndistinct` the number of distinct combinations, `mcv` the common value
combinations. After it, the cardinality estimate on correlated columns
becomes close to reality.

Question 9

How do you read EXPLAIN ANALYZE, and what do you look at first?

Accepted Answer

EXPLAIN shows the plan with estimates, and EXPLAIN ANALYZE also runs the
query, adding the actual numbers. The main diagnostic move is comparing the
estimated `rows` with the `actual rows` on each node. A large gap (an order
of magnitude or more) means the planner missed the cardinality and the plan
is built on false numbers. Next you look at `loops` (a node in a nested
loop runs many times), `Rows Removed by Filter` (read a lot, throw a lot
away, an index is asking for it), and `Buffers` (how many pages were
actually read). You hunt the bottleneck from the bottom up: the first large
estimation error or the most expensive node by fact.

Planner, cost, statistics, joins