linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
  • PostgreSQL internals
    Page and tuple, MVCC, vacuum, WAL, the planner and indexes
    →
All tutorials →
PricingAboutSign inCreate account
/
Intro
Lessons
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.
linuxlab.io
Tutorials▾
  • Linux & networking
    File system, processes, TCP/IP, BGP and OSPF
    →
  • Terraform & IaC
    HCL, state, plan/apply on a LocalStack sandbox
    →
  • Git & GitHub
    Object model, plumbing, branching, GitHub Actions
    →
  • PostgreSQL internals
    Page and tuple, MVCC, vacuum, WAL, the planner and indexes
    →
All tutorials →
PricingAboutSign inCreate account
/
  • Introduction
  • Chapters
  • How it works
  • Lessons
  • Knowledge base
  • Interview prep
Cluster

Back to clusters

Planner, cost, statistics, joins

How declarative SQL turns into a plan: parse, rewrite, cost-based planning, execution. Where the cardinality estimates come from, what cost is, when a seq scan beats an index, how the join method and order are chosen. The most hands-on cluster for anyone who reads EXPLAIN.

9 questions · ~40 min read

Questions

On this page

  1. 01What stages does a query go through from text to result?
  2. 02What is cost in a plan, what is it made of, and is it time?
  3. 03How does startup cost differ from total cost, and what does LIMIT have to do with it?
  4. 04Where does the planner get its row-count estimate? What is selectivity?
  5. 05Nested loop, hash join, merge join: when does the planner pick each?
  6. 06Why does the planner sometimes take a seq scan instead of an index that seems to exist?
  7. 07How does the planner choose the join order, and what is GEQO?
  8. 08When are the ordinary statistics not enough and you need extended ones?
  9. 09How do you read EXPLAIN ANALYZE, and what do you look at first?

#query-lifecycle

juniorOften

What stages does a query go through from text to result?

What to say

Four stages. Parse: the text becomes a tree, and the syntax and object names are checked against the system catalog. Rewrite: rules and views are applied. A reference to a view, for instance, is expanded into a subquery, and row security (RLS) is layered on. Plan: the cost-based optimizer tries access methods and join orders and picks the cheapest plan. Execute: the tree of plan nodes runs in a "pull a row from the top" model (volcano), each node asking its children for rows one at a time. Splitting into stages is why a prepared statement can be planned once and executed many times.

What they want to hear

A senior should: - name the four stages in order and what each one does - give an example of rewriting: expanding a view into a subquery, RLS - explain that planning is separate from execution, and tie that to the plan cache of a prepared statement - understand the optimizer is cost-based: it compares estimates rather than running the variants

Pitfalls

  • ✗ Thinking a view is a "stored result". In reality it is expanded into the query text at the rewrite stage
  • ✗ Confusing planning with execution. Planning time and execution time are different numbers in EXPLAIN
  • ✗ Thinking the optimizer tries plans for real. It estimates their cost by formulas

Follow-up

  • ? What happens to a `SELECT` from a view at the rewrite stage?
  • ? Why are planning time and execution time separated in EXPLAIN output?
  • ? How are the stages tied to a prepared statement's plan cache?

Depth in knowledge base

  • The life of a query: five stages
  • Query rewrite: views and RLS
tags: planner, lifecycle, executionbook: postgresql_internals-17.pdf:part4 query execution

#cost-model

intermediateOften

What is cost in a plan, what is it made of, and is it time?

What to say

Cost is not time but a dimensionless estimate of work in conditional units. The base bricks are set by parameters: `seq_page_cost` (reading a page in sequence, the anchor at 1.0), `random_page_cost` (a random read, 4.0 by default), `cpu_tuple_cost`, `cpu_index_tuple_cost`, `cpu_operator_cost`. A node's cost is the number of pages times the page cost plus the number of rows times the per-row cost. A seq scan, for example, is roughly `relpages * seq_page_cost + reltuples * cpu_tuple_cost`. The optimizer compares the total costs of the variants and takes the smallest. So on an SSD it makes sense to lower `random_page_cost`, since a random read there is almost like a sequential one.

What they want to hear

A senior should: - stress that cost is arbitrary units, not milliseconds - name the main cost parameters and their meaning, especially `random_page_cost` versus `seq_page_cost` - reproduce the seq scan formula and understand cost grows with the number of pages and rows - tie tuning `random_page_cost` to the storage type (SSD versus HDD)

Pitfalls

  • ✗ Reading cost as time in milliseconds. It is arbitrary units for comparing plans
  • ✗ Leaving `random_page_cost=4` on an SSD. The optimizer will underrate indexes
  • ✗ Comparing cost between different queries. It only makes sense to compare plans of one query

Follow-up

  • ? Why is `random_page_cost` usually lowered on an SSD?
  • ? What is a Seq Scan node's cost made of?
  • ? Can you compare the cost of two different queries with each other?

Depth in knowledge base

  • The cost model
  • Startup cost vs total cost
  • Access methods: Seq, Index, Bitmap, Index-Only
tags: planner, cost, tuningbook: postgresql_internals-17.pdf:part4 query execution · codelibs.ru_maksimalnaya-proizvoditelnost-arhitekturnye-podhody-k-optimizacii-zaprosov-v-postgresql.pdf:cost estimation

#startup-vs-total-cost

intermediateSometimes

How does startup cost differ from total cost, and what does LIMIT have to do with it?

What to say

Each plan node has two estimates: startup cost, the work before the node returns its first row, and total cost, the work up to the last row. A seq scan has a near-zero startup: it returns the first row right away. A sort or a hash aggregate has a high startup: until all input rows are read, there is no first output row. This matters for `LIMIT`: with `LIMIT 10` the optimizer looks not at total but at the cost of getting the first rows, and a plan with a cheap start (for example an index scan in the needed order) can beat a plan with a cheap total but an expensive start (a seq scan plus a sort).

What they want to hear

A senior should: - define startup as "up to the first row" and total as "up to the last" - give nodes with a high start: sort, hash aggregate, materialize - explain the link to LIMIT: a query with a limit is judged by the cost of the first rows - show why an index in the needed order beats "seq scan plus sort" precisely when there is a LIMIT

Pitfalls

  • ✗ Looking only at total cost when there is a `LIMIT`. The cost of the first rows decides
  • ✗ Thinking every node has a near-zero startup. For sort and hash it is high
  • ✗ Being surprised the planner picks different plans with and without LIMIT

Follow-up

  • ? Why does `ORDER BY ... LIMIT 10` sometimes pick an index, while without LIMIT it picks seq scan plus sort?
  • ? Which nodes have a high startup cost?
  • ? How does LIMIT change a plan's estimate?

Depth in knowledge base

  • Startup cost vs total cost
  • The cost model
tags: planner, cost, limitbook: postgresql_internals-17.pdf:part4 query execution

#statistics-selectivity

intermediateOften

Where does the planner get its row-count estimate? What is selectivity?

What to say

The planner relies on statistics that `ANALYZE` collects and `pg_statistic` holds (readable through `pg_stats`). For a column that is the fraction of NULLs, the number of distinct values (`n_distinct`), the list of the most common values (MCV) with their frequencies, a histogram of bounds for the rest, plus the correlation of physical order with logical order. Selectivity is the fraction of rows that will pass a condition, a number from 0 to 1. For `= 'x'` it takes the frequency from MCV or `1/n_distinct`; for `>`/`<` it is the fraction by the histogram. A node's cardinality is selectivity times the row count. An estimation error low in the plan propagates upward and ruins the choice of joins.

What they want to hear

A senior should: - list what statistics hold: null_frac, n_distinct, MCV, histogram, correlation - explain how selectivity is computed from this for equality and for a range - say that cardinality is selectivity times the row count, and that errors grow from the bottom up - give the levers: `ANALYZE`, `default_statistics_target`, and understand that stale statistics wreck plans

Pitfalls

  • ✗ Forgetting `ANALYZE` after a bulk load. The planner counts on empty or stale statistics
  • ✗ Confusing selectivity with a row count. It is a fraction from 0 to 1, and the cardinality comes from multiplying
  • ✗ Thinking the planner has statistics on a join result. It does not, which is why the error grows up the plan

Follow-up

  • ? How do you estimate the selectivity of `col > 100` from `pg_stats`?
  • ? Why does a cardinality error amplify on the upper plan nodes?
  • ? When does raising `default_statistics_target` help?

Depth in knowledge base

  • Planner statistics: pg_stats
  • Selectivity and the crossover point
  • The cost model
tags: planner, statistics, selectivitybook: postgresql_internals-17.pdf:part4 query execution · codelibs.ru_postgresql-query-optimization-the-ultimate-guide-to-building-efficient-queries.pdf:statistics

#join-algorithms

intermediateOften

Nested loop, hash join, merge join: when does the planner pick each?

What to say

Nested loop: for each row of the outer set, find matches in the inner one; it pays off when there are few outer rows and the inner access is cheap (usually an index). Hash join: a hash table is built in memory from the smaller table, and the larger one probes it; good for large sets with no useful order, but it needs `work_mem` for the hash and only works on equality. Merge join: both inputs are sorted (or arrive already sorted by an index) and merged like two ordered lists; it pays off on large sets when the sort is cheap or the order already exists. Cost makes the choice: sizes, the presence of indexes, available memory.

What they want to hear

A senior should: - describe the mechanics of all three and the typical niche of each - tie hash join to `work_mem`: out of memory means the hash spills to disk in several passes - explain that a nested loop with an index is about a small outer side, while a merge join is about already-sorted large inputs - understand the choice is not rigid: the optimizer compares cost given the statistics and memory

Pitfalls

  • ✗ Calling a nested loop always bad. On a small outer set with an index it is the best
  • ✗ Forgetting `work_mem` for a hash join. Short on memory, the hash goes to disk and gets much more expensive
  • ✗ Thinking a merge join is free when there is an ORDER BY. Sorting the inputs costs too

Follow-up

  • ? What happens to a hash join if the hash table does not fit in `work_mem`?
  • ? When does a nested loop turn out to be the best choice?
  • ? Why is a merge join cheap when the inputs are already sorted?

Depth in knowledge base

  • Join algorithms: nested loop, hash, merge
  • The cost model
tags: planner, joins, work-membook: postgresql_internals-17.pdf:part4 query execution

#seqscan-index-crossover

intermediateOften

Why does the planner sometimes take a seq scan instead of an index that seems to exist?

What to say

Index access is cheaper only at low selectivity, when the condition picks a small fraction of rows. Each row found by the index usually needs a random read of a heap page, and random reads are expensive. When a condition passes, say, a third of the table, it is cheaper to read it all in sequence (a seq scan) than to make hundreds of thousands of random index lookups. There is a crossover by row fraction beyond which a seq scan wins on cost. So on broad conditions the planner deliberately ignores the index, and it is right to. The crossover is shifted by `random_page_cost`, data correlation, and the availability of an index-only scan.

What they want to hear

A senior should: - explain the cause: index access pays for random heap reads, and at a high row fraction that is more expensive than a sequential read - name the selectivity crossover and the factors that shift it (random_page_cost, correlation, the visibility map) - understand that choosing a seq scan on a broad condition is normal, not a bug - tie it to diagnostics: an inflated `random_page_cost` on an SSD makes it avoid indexes for nothing

Pitfalls

  • ✗ Thinking the presence of an index forces the planner to use it. At high selectivity a seq scan is cheaper
  • ✗ Forcing an index through `enable_seqscan=off` in production instead of analyzing the cost
  • ✗ Ignoring correlation: with poor correlation an index scan makes even more random reads

Follow-up

  • ? What does the row fraction at which a seq scan beats an index depend on?
  • ? How does `random_page_cost` shift the crossover?
  • ? Why does good data correlation make an index scan cheaper?

Depth in knowledge base

  • Selectivity and the crossover point
  • Access methods: Seq, Index, Bitmap, Index-Only
  • The cost model
tags: planner, seqscan, selectivitybook: codelibs.ru_maksimalnaya-proizvoditelnost-arhitekturnye-podhody-k-optimizacii-zaprosov-v-postgresql.pdf:scan methods

#join-order-geqo

seniorSometimes

How does the planner choose the join order, and what is GEQO?

What to say

The number of join order variants grows factorially with the number of tables. Up to a small threshold the planner exhaustively enumerates them with dynamic programming and finds the optimum. When there are many tables (more than `geqo_threshold`, 12 by default), full enumeration gets too expensive, and the genetic optimizer GEQO kicks in: it finds a good but not guaranteed best order in reasonable time. The number of considered variants is affected by `join_collapse_limit` and `from_collapse_limit`: they set how deeply subqueries and explicit JOINs are unfolded into a common pool for enumeration. An explicit JOIN order at a low limit is fixed as written.

What they want to hear

A senior should: - explain the combinatorial explosion and the switch from exact enumeration to GEQO by `geqo_threshold` - understand the role of `join_collapse_limit`/`from_collapse_limit`: they decide what enters the common pool for order enumeration - know the side effect: at a low `join_collapse_limit` the explicit JOIN order is fixed and can become a manual control - see the GEQO trade-off: less planning time at the cost of possible non-optimality

Pitfalls

  • ✗ Thinking the planner always finds the optimal order. Under GEQO it is a heuristic
  • ✗ Not knowing about `join_collapse_limit` and being surprised that reordering JOINs changes the plan
  • ✗ Blindly raising `geqo_threshold` on queries with dozens of tables. Planning time soars

Follow-up

  • ? What does lowering `join_collapse_limit` to 1 change?
  • ? At how many tables does GEQO kick in by default?
  • ? What does GEQO trade for planning speed?

Depth in knowledge base

  • Join order, GEQO, and the plan cache
  • Join algorithms: nested loop, hash, merge
tags: planner, join-order, geqobook: postgresql_internals-17.pdf:part4 query execution

#extended-statistics

seniorSometimes

When are the ordinary statistics not enough and you need extended ones?

What to say

Ordinary statistics are collected per column and assume the columns are independent. When columns correlate, that assumption breaks. The classic case: `city` and `region`. For `WHERE city='Moscow' AND region='Moscow Oblast'` the planner estimates it as the product of two selectivities and badly underrates the row count, because these conditions nearly duplicate each other. Extended statistics (`CREATE STATISTICS`) teach the planner the dependencies: the `dependencies` type catches functional dependencies, `ndistinct` the number of distinct combinations, `mcv` the common value combinations. After it, the cardinality estimate on correlated columns becomes close to reality.

What they want to hear

A senior should: - explain the root: column independence in ordinary statistics and the underestimate on correlated predicates - give a clear correlated pair and show how the error leads to a bad plan - name the kinds of extended statistics: dependencies, ndistinct, mcv, and what each fixes - understand the statistics must be created explicitly and re-collected with `ANALYZE`. It will not appear on its own

Pitfalls

  • ✗ Thinking the planner accounts for column relationships on its own. By default it treats them as independent
  • ✗ Creating `CREATE STATISTICS` and forgetting `ANALYZE`. The estimates will not update
  • ✗ Slapping extended statistics on every column pair. It costs to collect and is only needed where there is correlation and pain

Follow-up

  • ? Why is `WHERE city=... AND region=...` underrated without extended statistics?
  • ? How does the `dependencies` type differ from `mcv` in `CREATE STATISTICS`?
  • ? What do you have to do after creating extended statistics for it to take effect?

Depth in knowledge base

  • Extended statistics (CREATE STATISTICS)
  • Planner statistics: pg_stats
  • Selectivity and the crossover point
tags: planner, statistics, correlationbook: postgresql_internals-17.pdf:part4 query execution · codelibs.ru_postgresql-query-optimization-the-ultimate-guide-to-building-efficient-queries.pdf:extended statistics

#explain-estimated-actual

intermediateOften

How do you read EXPLAIN ANALYZE, and what do you look at first?

What to say

EXPLAIN shows the plan with estimates, and EXPLAIN ANALYZE also runs the query, adding the actual numbers. The main diagnostic move is comparing the estimated `rows` with the `actual rows` on each node. A large gap (an order of magnitude or more) means the planner missed the cardinality and the plan is built on false numbers. Next you look at `loops` (a node in a nested loop runs many times), `Rows Removed by Filter` (read a lot, throw a lot away, an index is asking for it), and `Buffers` (how many pages were actually read). You hunt the bottleneck from the bottom up: the first large estimation error or the most expensive node by fact.

What they want to hear

A senior should: - name the "estimated versus actual rows" move as the first step of the analysis - read `loops`, `Rows Removed by Filter`, `Buffers` and understand what each signals - look for the root cause from the bottom up rather than grabbing the top expensive node - distinguish EXPLAIN (estimates only) from EXPLAIN ANALYZE (a real run with all consequences, including a write on DML)

Pitfalls

  • ✗ Running `EXPLAIN ANALYZE` on an INSERT/UPDATE/DELETE without a transaction to roll back. It really performs the change
  • ✗ Looking only at the top node's total time and not comparing estimates with fact
  • ✗ Ignoring `loops`: an expensive node inside a nested loop is multiplied by the number of iterations

Follow-up

  • ? What does a tenfold gap between `rows` and `actual rows` mean?
  • ? How do you safely run `EXPLAIN ANALYZE` for an `UPDATE`?
  • ? What does a large `Rows Removed by Filter` tell you?

Depth in knowledge base

  • The life of a query: five stages
  • Planner statistics: pg_stats
  • The cost model
tags: planner, explain, diagnosticsbook: codelibs.ru_maksimalnaya-proizvoditelnost-arhitekturnye-podhody-k-optimizacii-zaprosov-v-postgresql.pdf:explain analyze
Footer
linuxlab-TutorialsPricingAboutPrivacy & cookies
Copyright © 2026 LinuxLab. All rights reserved.