State, backend, lock, drift

The most common topic in a Terraform interview. State is what sets Terraform apart from kubectl apply. What it holds, why you need a lock, how to catch drift, and when state surgery is acceptable. These come from real DevOps interviews: AWS teams, banks, mid-size infrastructure teams.

7 вопросов · ~28 мин чтения

#what-is-state-and-why

juniorчасто

Why does Terraform need state? Why not just read the provider's API?

Что отвечать

State is the map between HCL and the real provider resources. It holds resource ids, computed attributes, and metadata. Without state Terraform does not know which `aws_s3_bucket.demo` in HCL matches which real bucket, because names in HCL are addresses, not global identifiers. Reading the API every time is expensive and sometimes impossible (data lag, eventual consistency). It also would not cover the "removed from HCL, so destroy the resource" case.

Что хотят услышать

A senior should: - explain addressing: `aws_s3_bucket.demo` is an address in the graph, while the real id (say `arn:aws:s3:::my-bucket-12345`) lives in state - name what state is for: mapping HCL to real ids, caching computed attributes, and knowing before-and-after for the plan diff - say that without state every plan would be a full refresh, which makes no sense time-wise for tens of thousands of resources - mention `terraform refresh` as a forced re-read from the API, usually not needed since apply refreshes on its own

Подводные камни

✗ Saying state is a 'config backup.' The config is the HCL, state is the mapping
✗ Thinking state can be rebuilt from the API. Partly yes, through import, but computed attributes like random_id will not come back
✗ Not mentioning why computed attributes live in state, which leaves it unclear where the passwords come from (sensitive values in state)

Follow-up

? What happens if you delete the state file? Which data can you recover?
? How does `terraform refresh` differ from `terraform plan -refresh-only`?
? Why are computed attributes in state if they were never in the HCL?

Глубина в базе знаний

#remote-backend-why-and-lock

intermediateчасто

Why a remote backend, and what is a state lock? What happens without one?

Что отвечать

Local state does not work for a team: two developers push out-of-sync changes and break each other's state. A remote backend (S3+DynamoDB, GCS, Azure Blob, Terraform Cloud) solves two problems: shared storage and locking. A lock is an exclusive mutex held for the length of plan/apply. Without a lock, two applies at once create a race: both read state, both apply, the last one writes, and some resources drop out of state and become orphans in the provider.

Что хотят услышать

A senior should: - separate storage (S3) from locking (DynamoDB in the AWS stack). S3 on its own gives no lock, so you need an external mechanism - note that the lock is taken at the start of plan/apply and held to the end. `-lock-timeout=10m` saves you in CI when someone's apply hangs - cover `terraform force-unlock <id>` and when it is safe (only when you are sure the process that held the lock is dead) - mention that Terraform Cloud / TFE folds locking and storage into one service, with an audit log on top - name the risk: state in S3 with no encryption means passwords in the clear. At a minimum use SSE, better KMS plus an IAM bucket policy

Подводные камни

✗ Saying DynamoDB stores state. No, only the lock, the state lives in S3
✗ Running force-unlock 'just in case' when an apply looks stuck. Often the apply is alive, and force-unlock lets it write to state with no lock
✗ Turning on S3 versioning 'just in case' and forgetting the lifecycle rules. Cost grows with every apply

Follow-up

? What does `terraform force-unlock` do under the hood?
? How is a lock in S3+DynamoDB different from a lock in Terraform Cloud?
? Can you use S3 alone, without DynamoDB, for solo development?

Глубина в базе знаний

#terraform-import-when-and-pitfalls

intermediateчасто

What is `terraform import` for, and where does it hurt?

Что отвечать

Import adopts an already existing resource into state without creating it again. You use it when a resource was made by hand (or by another tool) and you want to start managing it with Terraform. The pain: import creates the state entry, but you write the HCL yourself, by eye. If the HCL does not match reality, the next plan shows drift and tries to "fix" the resource to match the HCL. On complex resources (security groups, IAM policies) reproducing the structure by hand takes hours. Version 1.5 added the `import {}` block, which is declarative and can be committed to the repo.

Что хотят услышать

A senior should: - separate the CLI `terraform import` (mutates state, no planning) from the `import {}` block (goes through plan, reviewable in a PR) - name the classic workflow: `import`, then `terraform plan`, then keep filling in HCL until the plan becomes a no-op - mention `terraform plan -generate-config-out=imported.tf`, an HCL generator from a real resource in 1.5+. The HCL quality is so-so (no locals, no for_each), but it gives you a skeleton - note that import does not work for child modules directly; for those you need the full path `module.foo.aws_s3_bucket.demo`

Подводные камни

✗ Thinking import 'moves' the resource. It only binds state to the existing one, the resource itself is untouched
✗ Importing into state and forgetting to write the HCL. On the next apply Terraform sees 'the resource is in state but not in HCL' and destroys it
✗ Importing many resources in one scripted pass. One failure and state is inconsistent. Import one at a time and commit after each

Follow-up

? How is the `import {}` block fundamentally better than the CLI command?
? What does `-generate-config-out` do, and why can't you commit its output as is?
? How do you import a resource into a child module? What address do you write?

Глубина в базе знаний

#drift-detection-how-to-catch

intermediateчасто

What is drift, and how do you catch it in production?

Что отвечать

Drift is the gap between state and the real state in the provider. Someone opened the AWS Console and changed a tag by hand, a security group was updated by another tool, an IAM role was rolled back during incident response. To catch it, `terraform plan -refresh-only` or plain `terraform plan` shows the diff. In CI, run a cron job: `plan -detailed-exitcode`, where exit 2 means there are changes and you send an alert to Slack. Then you decide case by case: accept the drift with an apply, or push the resource back to state with an apply the other way.

Что хотят услышать

A senior should: - name `-detailed-exitcode`: 0 = no changes, 2 = changes, 1 = error. That is the building block for cron detection - separate "good drift" (someone fixed an incident faster than IaC could) from "bad drift" (someone broke the GitOps process). Both need an alert, but the response differs - mention that drift in cloud-init / user_data / Lambda code is usually fine: the real config lives elsewhere, and IaC only bootstraps it - name `ignore_changes` as the standard way to keep noisy attributes like `tags.LastModified` out of drift detection

Подводные камни

✗ Running `plan` in CI without `-detailed-exitcode` and parsing stdout with a regex. It breaks on every terraform update
✗ Treating every drift as an incident. On large infrastructure there is constant noise from cloud-managed attributes (auto-scaling adjustments)
✗ Fixing drift with a manual `apply` every time. You slowly drift away from the PR process and lose review

Follow-up

? What does exit code 2 from `terraform plan` mean, and why is it useful?
? How do you set up `ignore_changes` so it does not react to ASG capacity?
? Which is better for drift detection: a cron job in CI or Terraform Cloud drift detection?

Глубина в базе знаний

[[tf-drift-detection]]
[[tf-plan-apply-ci]]
terraform plan: see what Terraform is about to do
lifecycle: controlling resource behavior

#taint-vs-replace

intermediateиногда

How does `terraform taint` differ from `-replace`? Why is taint deprecated?

Что отвечать

`terraform taint` marked a resource in state as "needs recreation," and the next apply rebuilt it. The problem: taint mutated state right away, with no planning. If a colleague had already run plan and was about to apply, their plan was stale. Terraform 0.15 added `terraform apply -replace=<address>`: same behavior, but through plan, so it shows in the diff, gets discussed in a PR, and passes CI review. Taint is kept for backward compatibility, but new projects do not use it.

Что хотят услышать

The candidate should: - say that `-replace` is the right modern path and taint is legacy - explain the semantics: the resource is destroyed and created at the same address, and a new id lands in state - mention the `create_before_destroy` lifecycle and how it changes the replace order for zero downtime - name the case: replace helps when the HCL has not changed but the resource is corrupted (a DB instance in FAILED state, an unhealthy EC2, a config that drifted)

Подводные камни

✗ Using taint in a pipeline. Every run mutates state, and colleagues hit surprise changes
✗ Saying replace changes the id. The id changes only if the resource is actually recreated (a new ARN, a new EC2)
✗ Applying replace to a resource with `prevent_destroy = true` and being surprised the plan fails. The lifecycle guard wins

Follow-up

? What happens to dependent resources during a `-replace`?
? Can you replace a data source? Why not?
? In which scenarios is `-replace` better than `destroy` plus `apply`?

Глубина в базе знаний

#state-secrets-and-risks

seniorиногда

Sensitive data in state: what's wrong with it, and how do you protect it?

Что отвечать

State writes any attribute to plain JSON, including passwords, keys, and secrets. `sensitive = true` on an output only hides the value from the CLI; in state it still sits there as is. Protection happens at the backend level: S3 with KMS encryption, an IAM bucket policy that follows least privilege, an access log on the bucket. The main rule: do not put plain secrets in HCL. Read them from Vault/SSM/Secrets Manager through a data source, so the secret at least stays out of git, even though it still settles into state.

Что хотят услышать

A senior should: - say plainly that `sensitive = true` is only a UI filter, not encryption. State is not the place for secrets, but they end up there anyway - name KMS plus IAM plus access logs as the minimum layered protection - mention that Terraform Cloud / TFE encrypts state at rest automatically, with an audit log of who read it - name the approach for credential rotation: rotate in the secrets store, then apply recreates the resource with the new value. State will hold the new value, and the old one is gone everywhere (with versioning on it stays in old versions, so do not forget the lifecycle rules)

Подводные камни

✗ Saying `sensitive = true` encrypts. No, it only masks
✗ Putting a key in .tfvars and committing it. State will hold it, and so will the git history
✗ Not setting a bucket policy and leaving state publicly readable through S3. That is a class of leak seen in several real incidents

Follow-up

? What does `terraform output -raw <name>` do with a sensitive value?
? Why is the 'secret through a data source' approach better than `sensitive` on a variable?
? How do you set up an IAM bucket policy for the state bucket with least privilege for CI?

Глубина в базе знаний

#state-surgery-mv-rm-when-ok

seniorиногда

`terraform state mv` and `rm`: when is it fine, and when is it dangerous?

Что отвечать

State surgery comes up during refactoring: you renamed a resource in HCL, pulled it into a module, split a big root into several. Without `state mv` Terraform sees "the old one is gone, the new one is not in state" and wants to destroy and create. The danger: `state rm` detaches a resource from state, but it stays in the provider. Forget about it and you have an orphan. Since 1.1 the `moved {}` block replaces `state mv` in most cases: it is declarative, reviewed in a PR, and does not mutate state before apply.

Что хотят услышать

A senior should: - say that the `moved {}` block is preferable to `state mv` for renames and moves between modules, following the "through plan, not through mutation" logic - name when `state rm` is justified: the resource was handed to another team or another root module, and its state entry already exists there. Without `rm` you get two owners of one resource - mention `terraform state replace-provider`, a rare but needed operation when the provider source changes (registry.terraform.io to registry.opentofu.org) - note that after any surgery the first `plan` must be a no-op. If it shows a diff, your `mv` was off, so undo it and think again

Подводные камни

✗ Running `state rm` without deleting the resource in the provider. You get an orphan that no one pays attention to, but AWS keeps charging for it
✗ Using `state mv` instead of `moved {}` in a team setup. A colleague sees your stale plan and complains
✗ Doing surgery with no state backup. `terraform state pull > state.backup.json` before any `mv`/`rm` is the rule

Follow-up

? Why is the `moved {}` block fundamentally safer than `terraform state mv`?
? When do you need `state rm`, where the `removed {}` block won't replace it?
? How do you undo a botched `state mv`? What goes into the .backup?

Глубина в базе знаний

State, backend, lock, drift

7 вопросов · ~28 мин чтения