Question 1

Which system views do you start a live database's diagnostics from?

Accepted Answer

First, `pg_stat_activity`: who is connected now, what they run, for how
long, in what state (active, idle, idle in transaction), and what they wait
on (`wait_event`). It immediately shows stuck transactions and locks.
`pg_stat_statements` (an extension) aggregates over normalized queries:
total time, call count, average, buffer reads; the main tool for "which
queries eat the server". `pg_locks` shows who blocks whom.
`pg_stat_user_tables` and `pg_stat_user_indexes` show scans, dead versions,
the last autovacuum, index usage. `pg_stat_io` (PG 16) gives a picture of
reads and writes by type. This is the set any incident analysis starts from.

Question 2

How do you find and analyze a slow query in production?

Accepted Answer

First find the culprit, then analyze it. You find it through
`pg_stat_statements`: sort by total time (`total_exec_time`) or by average
and look at the top. In parallel you turn on the slow query log
(`log_min_duration_statement`) to catch specific executions with their
parameters, and `auto_explain` to put the plan of long queries straight
into the log. Having found the query, you run `EXPLAIN (ANALYZE, BUFFERS)`
and compare estimates with fact, looking for where the cardinality blows
up, whether there is a seq scan where an index is asked for, and whether a
sort spills to disk. The rule: measure first (`pg_stat_statements`,
EXPLAIN), then change, not the other way around.

Question 3

Why do you need a connection pool, and why is "just add more connections" bad?

Accepted Answer

Each connection in PostgreSQL is a separate OS process with its own memory.
A thousand connections is a thousand processes: they compete for the CPU,
context switching eats time, and the combined `work_mem` can devour all the
memory, because it is counted per operation, not per server. So more
connections almost always means slower, not faster. The solution is a pool:
pgbouncer keeps a small number of real connections to the database and
multiplexes many client ones onto them. Transaction pooling hands a server
connection out for the duration of a transaction and returns it to the
pool, so hundreds of clients work through dozens of real connections. A
sensible ceiling on real connections is usually around the number of cores
times a small factor.

Question 4

shared_buffers, work_mem, maintenance_work_mem: how do you think about them?

Accepted Answer

`shared_buffers` is the shared buffer cache for the whole server, a
sensible start around a quarter of RAM; the rest is left to the OS cache,
because PostgreSQL relies on it too. `work_mem` is memory for one sort or
hash operation in a query, not per query and not per server: a complex
query with several sorts and parallelism can take several `work_mem` at
once, and hundreds of connections multiply that many times over, so you
keep it moderate and raise it pointwise. `maintenance_work_mem` is memory
for maintenance (vacuum, `CREATE INDEX`), and it can be set generously,
because few such operations run at the same time. The key trap is thinking
`work_mem` is allocated once per server.

Question 5

Logical dump versus a physical backup with PITR: when do you use each?

Accepted Answer

A logical backup (`pg_dump`/`pg_dumpall`) exports the data as a set of
commands or an archive: portable across versions and platforms, handy for a
single database or table, but slow to restore on large volumes and gives a
snapshot only as of the dump moment. A physical backup (`pg_basebackup` or a
directory copy) plus continuous WAL archiving gives PITR (point-in-time
recovery): you can restore the cluster to any moment between the base
backup and the end of the archive, for example a second before an
accidental `DELETE`. For large production the base is a physical backup plus
a WAL archive; a logical dump is a supplement for portability and selective
restore. Both have to be checked regularly with a trial restore.

Question 6

Name the common PostgreSQL operations anti-patterns and why each is harmful.

Accepted Answer

Disabling autovacuum "so it does not get in the way" is a direct path to
bloat and a wraparound emergency. Holding long and idle in transaction
transactions stalls the horizon and piles up garbage. Slapping an index on
every column: each one slows writes and eats space, and the planner often
does not use them. Running `ALTER TABLE` on a hot table without
`lock_timeout`: the lock queue jams solid. Inflating the connection count
instead of using a pool. Storing huge values with no regard for TOAST and
UPDATE load. Doing `SELECT *` and pulling TOAST where it is not needed. Not
monitoring the age of transactions and replication slots. Each item is a
typical cause of a real incident, not theory.

Question 7

What is table bloat, how do you detect it, and how do you remove it?

Accepted Answer

Bloat is space taken by dead row versions and gaps in pages that no longer
carries useful data. It grows when dead versions appear faster than vacuum
removes them: a heavy UPDATE/DELETE load, a lagging autovacuum, a horizon
held by long transactions. The symptoms: the table and indexes grow while
the live row count does not; an index-only scan degrades into Heap Fetches.
You detect it through `pg_stat_user_tables` (`n_dead_tup`), the `pgstattuple`
extension for a precise estimate, and estimating queries over the catalog.
You cure it in escalating order: fix autovacuum and remove long transactions
(prevention), and for an already bloated object use `VACUUM FULL`,
`CLUSTER`, or `pg_repack`/`REINDEX CONCURRENTLY` for indexes.

Question 8

What goes into the basic security of the PostgreSQL engine?

Accepted Answer

Several layers. Authentication, `pg_hba.conf`: who connects, from where,
and by which method; you set `scram-sha-256` instead of the outdated md5 and
close `trust` in production. Authorization, roles and privileges by least
privilege: the application must not run as a superuser, it has its own role
with grants only on the objects it needs, and the `public` schema is not
open for writes to everyone. Transport, TLS for connections over the
network. Data protection, separating the schema owner role from the
application's working role, a careful `SET ROLE`, revoking excess `GRANT`.
Plus hygiene: no passwords in the code, restrict the network at the firewall
level, keep the server behind a perimeter rather than on a public address.

Operations, observability, anti-patterns