Back to clusters

Operations, observability, anti-patterns

What separates someone who has read about PostgreSQL from someone who has run it: diagnostics through pg_stat, finding slow queries, connection pooling, memory tuning, backup and PITR, bloat and the usual anti-patterns, basic engine security. The cluster about how not to take production down and how to understand fast what is wrong with it.

8 questions · ~35 min read

#pg-stat-views

intermediateOften

Which system views do you start a live database's diagnostics from?

What to say

First, `pg_stat_activity`: who is connected now, what they run, for how long, in what state (active, idle, idle in transaction), and what they wait on (`wait_event`). It immediately shows stuck transactions and locks. `pg_stat_statements` (an extension) aggregates over normalized queries: total time, call count, average, buffer reads; the main tool for "which queries eat the server". `pg_locks` shows who blocks whom. `pg_stat_user_tables` and `pg_stat_user_indexes` show scans, dead versions, the last autovacuum, index usage. `pg_stat_io` (PG 16) gives a picture of reads and writes by type. This is the set any incident analysis starts from.

What they want to hear

A senior should: - name `pg_stat_activity` as the starting point and what to look at in it (state, wait_event, duration) - know `pg_stat_statements` as the way to find heavy queries by aggregates - list `pg_locks`, `pg_stat_user_tables/indexes`, `pg_stat_io` and what each answers - move from symptom to view: generally slow means statements, one thing hangs means activity plus locks

Pitfalls

  • Starting diagnostics by guessing from the logs instead of `pg_stat_activity` and `pg_stat_statements`
  • Not enabling `pg_stat_statements` ahead of time and being left without history at the moment of an incident
  • Looking only at the average query time and missing a rare but catastrophically expensive one

Follow-up

  • ? How do you find a transaction stuck in idle in transaction through `pg_stat_activity`?
  • ? What does `pg_stat_statements` give that the slow query log does not?
  • ? Where do you look to tell whether an index is used?

Depth in knowledge base

tags: ops, monitoring, pg-statbook: codelibs.ru_monitoring-postgresql.pdf:pg_stat views

#find-slow-queries

intermediateOften

How do you find and analyze a slow query in production?

What to say

First find the culprit, then analyze it. You find it through `pg_stat_statements`: sort by total time (`total_exec_time`) or by average and look at the top. In parallel you turn on the slow query log (`log_min_duration_statement`) to catch specific executions with their parameters, and `auto_explain` to put the plan of long queries straight into the log. Having found the query, you run `EXPLAIN (ANALYZE, BUFFERS)` and compare estimates with fact, looking for where the cardinality blows up, whether there is a seq scan where an index is asked for, and whether a sort spills to disk. The rule: measure first (`pg_stat_statements`, EXPLAIN), then change, not the other way around.

What they want to hear

A senior should: - split the steps: find the heavy query (statements/log) and analyze it (EXPLAIN ANALYZE BUFFERS) - name `log_min_duration_statement` and `auto_explain` as ways to catch specific executions - read the plan by the "estimate versus fact" move and look for seq scan, a sort to disk, a growing cardinality - hold to "measure first, change second" instead of guessing at indexes

Pitfalls

  • Throwing indexes at it on a hunch without finding the real culprit in `pg_stat_statements`
  • Looking only at the average time and missing a query with rare but huge spikes
  • Changing the config at random instead of analyzing a specific plan

Follow-up

  • ? How does `pg_stat_statements` complement `log_min_duration_statement`?
  • ? What does `auto_explain` show, and when do you turn it on?
  • ? What in `EXPLAIN (ANALYZE, BUFFERS)` points to a shortage of `work_mem`?

Depth in knowledge base

tags: ops, slow-queries, explainbook: codelibs.ru_postgresql-query-optimization-the-ultimate-guide-to-building-efficient-queries.pdf:finding slow queries · codelibs.ru_monitoring-postgresql.pdf:slow queries

#connection-pooling

intermediateOften

Why do you need a connection pool, and why is "just add more connections" bad?

What to say

Each connection in PostgreSQL is a separate OS process with its own memory. A thousand connections is a thousand processes: they compete for the CPU, context switching eats time, and the combined `work_mem` can devour all the memory, because it is counted per operation, not per server. So more connections almost always means slower, not faster. The solution is a pool: pgbouncer keeps a small number of real connections to the database and multiplexes many client ones onto them. Transaction pooling hands a server connection out for the duration of a transaction and returns it to the pool, so hundreds of clients work through dozens of real connections. A sensible ceiling on real connections is usually around the number of cores times a small factor.

What they want to hear

A senior should: - explain that a connection is a process, and why a growing connection count hits the CPU and memory - tie the danger to `work_mem`: it is per operation, and many connections multiply the consumption - describe pgbouncer and transaction pooling as multiplexing onto a small number of real connections - give a guide for the number of real connections and warn about the incompatibilities of transaction pooling (prepared statements, advisory locks)

Pitfalls

  • Raising `max_connections` into the thousands instead of using a pool. The server chokes on context switching and memory
  • Forgetting that `work_mem` is multiplied by the number of operations and connections, and hitting OOM
  • Using session features (prepared statements, advisory locks) in transaction pooling without regard for the mode

Follow-up

  • ? Why is `work_mem` dangerously multiplied with a large number of connections?
  • ? How does transaction pooling differ from session pooling?
  • ? What is a rough guide for the number of real connections to the database?

Depth in knowledge base

tags: ops, pooling, pgbouncerbook: codelibs.ru_postgresql-mistakes-and-how-to-avoid-them.pdf:connection management · codelibs.ru_mastering-postgresql-15-advanced-techniques-to-build-and-manage-scalable-reliable-and-fault-tolerant-database-applications-5-ed.pdf:connection pooling

#memory-config

intermediateOften

shared_buffers, work_mem, maintenance_work_mem: how do you think about them?

What to say

`shared_buffers` is the shared buffer cache for the whole server, a sensible start around a quarter of RAM; the rest is left to the OS cache, because PostgreSQL relies on it too. `work_mem` is memory for one sort or hash operation in a query, not per query and not per server: a complex query with several sorts and parallelism can take several `work_mem` at once, and hundreds of connections multiply that many times over, so you keep it moderate and raise it pointwise. `maintenance_work_mem` is memory for maintenance (vacuum, `CREATE INDEX`), and it can be set generously, because few such operations run at the same time. The key trap is thinking `work_mem` is allocated once per server.

What they want to hear

A senior should: - distinguish the scopes: shared_buffers (the whole server), work_mem (per operation), maintenance_work_mem (per maintenance) - explain the danger of `work_mem`: it multiplies by operations, parallelism, and the connection count - give starting guides and say why you do not hand all memory to shared_buffers (the OS cache, the risk of heavy checkpoints) - advise raising `work_mem` locally for heavy analytical queries rather than globally

Pitfalls

  • Treating `work_mem` as a global limit. It is per operation, and the total easily drives to OOM
  • Handing nearly all memory to `shared_buffers`. The OS cache is needed too
  • Setting a large `work_mem` globally for the sake of a couple of analytical queries and risking memory on an OLTP load

Follow-up

  • ? Why can't you crank `work_mem` globally on a loaded server?
  • ? Why leave memory to the OS cache instead of giving it all to `shared_buffers`?
  • ? How does `maintenance_work_mem` differ from `work_mem` in risk?

Depth in knowledge base

tags: ops, memory, tuningbook: codelibs.ru_postgresql-10-administration-cookbook-over-165-effective-recipes-for-database-management-and-maintenance-in-postgresql-10-4-ed.pdf:memory configuration

#backup-pitr

intermediateOften

Logical dump versus a physical backup with PITR: when do you use each?

What to say

A logical backup (`pg_dump`/`pg_dumpall`) exports the data as a set of commands or an archive: portable across versions and platforms, handy for a single database or table, but slow to restore on large volumes and gives a snapshot only as of the dump moment. A physical backup (`pg_basebackup` or a directory copy) plus continuous WAL archiving gives PITR (point-in-time recovery): you can restore the cluster to any moment between the base backup and the end of the archive, for example a second before an accidental `DELETE`. For large production the base is a physical backup plus a WAL archive; a logical dump is a supplement for portability and selective restore. Both have to be checked regularly with a trial restore.

What they want to hear

A senior should: - contrast logical (portable, as of the dump moment, slow restore) with physical plus WAL (fast restore, PITR) - explain PITR: a base backup plus a WAL archive give restore to a precise point in time - tie it to the previous point: it is PITR, not a replica, that saves you from a logical mistake - stress checking backups: an unverified backup is not a backup

Pitfalls

  • Relying only on `pg_dump` for large production. The restore takes hours and loses everything after the dump
  • Keeping backups and never trying a restore. In an incident they turn out to be broken
  • Treating the WAL archive as optional. Without it a physical backup gives only the base copy moment, no PITR

Follow-up

  • ? What besides a base backup do you need to get PITR?
  • ? Why is `pg_dump` a poor sole backup for a terabyte database?
  • ? How do you verify a backup works without waiting for a disaster?

Depth in knowledge base

tags: ops, backup, pitrbook: codelibs.ru_postgresql-10-administration-cookbook-over-165-effective-recipes-for-database-management-and-maintenance-in-postgresql-10-4-ed.pdf:backup and recovery

#top-anti-patterns

intermediateOften

Name the common PostgreSQL operations anti-patterns and why each is harmful.

What to say

Disabling autovacuum "so it does not get in the way" is a direct path to bloat and a wraparound emergency. Holding long and idle in transaction transactions stalls the horizon and piles up garbage. Slapping an index on every column: each one slows writes and eats space, and the planner often does not use them. Running `ALTER TABLE` on a hot table without `lock_timeout`: the lock queue jams solid. Inflating the connection count instead of using a pool. Storing huge values with no regard for TOAST and UPDATE load. Doing `SELECT *` and pulling TOAST where it is not needed. Not monitoring the age of transactions and replication slots. Each item is a typical cause of a real incident, not theory.

What they want to hear

A senior should: - list the core anti-patterns: disabled autovacuum, long transactions, index overload, DDL without lock_timeout, refusing a pool - for each, name the concrete harm, not just "bad" - tie them to the mechanisms already covered (horizon, locks, TOAST, wraparound) - show that prevention is monitoring (transaction age, slots, bloat), not reaction after the fact

Pitfalls

  • Naming anti-patterns without explaining the mechanism of harm. That is exactly what they check
  • Treating "an index on everything" as optimization. On writes it is a pure tax
  • Confusing the symptom (slow) with the cause (bloat, a long transaction, a bad plan)

Follow-up

  • ? Why does "an index on every column" hurt rather than help?
  • ? What exactly is dangerous about disabling autovacuum?
  • ? Which metrics do you monitor to head off these problems?

Depth in knowledge base

tags: ops, anti-patterns, reliabilitybook: codelibs.ru_postgresql-mistakes-and-how-to-avoid-them.pdf:common mistakes

#table-bloat

seniorOften

What is table bloat, how do you detect it, and how do you remove it?

What to say

Bloat is space taken by dead row versions and gaps in pages that no longer carries useful data. It grows when dead versions appear faster than vacuum removes them: a heavy UPDATE/DELETE load, a lagging autovacuum, a horizon held by long transactions. The symptoms: the table and indexes grow while the live row count does not; an index-only scan degrades into Heap Fetches. You detect it through `pg_stat_user_tables` (`n_dead_tup`), the `pgstattuple` extension for a precise estimate, and estimating queries over the catalog. You cure it in escalating order: fix autovacuum and remove long transactions (prevention), and for an already bloated object use `VACUUM FULL`, `CLUSTER`, or `pg_repack`/`REINDEX CONCURRENTLY` for indexes.

What they want to hear

A senior should: - define bloat as dead versions plus gaps and name the causes: load, a lagging vacuum, a held horizon - name the diagnostics: `n_dead_tup`, `pgstattuple`, growing size with an unchanged row count - distinguish prevention (autovacuum, short transactions) from the cure (VACUUM FULL, pg_repack, REINDEX) - understand that without removing the cause, bloat returns after any rebuild

Pitfalls

  • Fixing bloat with `VACUUM FULL` in circles without removing the cause (long transactions, a weak autovacuum)
  • Measuring bloat by file size without `pgstattuple` and over- or underestimating it
  • Forgetting index bloat. It is cured by `REINDEX`, not by a plain table vacuum

Follow-up

  • ? How do you tell table growth from growth of the bloat itself?
  • ? Why is `VACUUM FULL` without removing the cause a temporary measure?
  • ? What cures index bloat rather than table bloat?

Depth in knowledge base

tags: ops, bloat, vacuumbook: codelibs.ru_postgresql-mistakes-and-how-to-avoid-them.pdf:bloat · codelibs.ru_monitoring-postgresql.pdf:table bloat

#engine-security

intermediateSometimes

What goes into the basic security of the PostgreSQL engine?

What to say

Several layers. Authentication, `pg_hba.conf`: who connects, from where, and by which method; you set `scram-sha-256` instead of the outdated md5 and close `trust` in production. Authorization, roles and privileges by least privilege: the application must not run as a superuser, it has its own role with grants only on the objects it needs, and the `public` schema is not open for writes to everyone. Transport, TLS for connections over the network. Data protection, separating the schema owner role from the application's working role, a careful `SET ROLE`, revoking excess `GRANT`. Plus hygiene: no passwords in the code, restrict the network at the firewall level, keep the server behind a perimeter rather than on a public address.

What they want to hear

A senior should: - lay it out in layers: authentication (`pg_hba`, scram), authorization (roles, least privilege), transport (TLS), network (firewall, perimeter) - name the specifics: remove `trust`, do not work as a superuser, close writes to `public` - understand the principle of least privilege for the application's role - tie it to operations: security is not one flag but a set of defaults that are easy to forget

Pitfalls

  • Leaving `trust` or `md5` in `pg_hba.conf` in production instead of `scram-sha-256`
  • Running the application as a superuser instead of a role with minimal rights
  • Keeping the server on a public address with no firewall, relying on the password alone

Follow-up

  • ? Why is `scram-sha-256` better than `md5` in `pg_hba.conf`?
  • ? Why is writing to the `public` schema for everyone a problem?
  • ? What does the principle of least privilege look like for an application role?

Depth in knowledge base

tags: ops, security, rolesbook: codelibs.ru_postgresql-10-administration-cookbook-over-165-effective-recipes-for-database-management-and-maintenance-in-postgresql-10-4-ed.pdf:security