Refactoring and large state

What to do when a monolithic root has grown to 3000 resources and plan takes seven minutes. The moved/removed blocks, splitting state across workspaces and separate roots, zero-downtime migration. Senior questions from real migrations at banks and large product teams.

5 вопросов · ~26 мин чтения

#moved-block-vs-state-mv

intermediateиногда

What is the `moved {}` block for, and why is it better than `terraform state mv`?

Что отвечать

`moved {}` declares a resource move in HCL: "what used to be `module.old.x` is now `module.new.x`." At plan, Terraform sees the block and moves the state entry as part of the change set, without destroy and create. The key difference from `state mv` is that it is declarative. The block is committed to the repo, reviewed in a PR, and visible in a colleague's `terraform plan`. Once everyone has run `apply`, you can drop the block a few releases later. `state mv` mutates state locally, shows up nowhere in the PR, and forces colleagues to do the same thing by hand, which is a source of drift between people.

Что хотят услышать

A senior should: - note that the moved block arrived in 1.1 as the standard refactoring mechanism - explain the grace period: the block stays in HCL for a few releases until every environment applies it, then it is removed - say that moved works only for addresses: rename a resource, move it into a child module, change a prefix. It does not work for changing the resource type or the provider - mention that one block describes one resource move; a mass refactor takes several blocks, and the review needs care

Подводные камни

✗ Using `state mv` in a team. Your state knows, your colleagues' state does not, and on their apply they see destroy and create
✗ Removing the moved block before every environment has applied it. The environments that have not applied yet get destroy and create
✗ Thinking moved can change a resource type (`aws_s3_bucket` to `aws_s3_bucket_v2`). It cannot; a type change needs import plus rm

Follow-up

? How many releases should you keep a moved block in the HCL?
? How do you rename a resource across providers (`aws` to `awscc`)?
? What happens if two moved blocks contradict each other?

Глубина в базе знаний

#removed-block

intermediateредко

What is the `removed {}` block, and when do you need it?

Что отвечать

`removed {}` (since 1.7) is a declarative way to take a resource out of Terraform's management without destroying it in the provider. It is the alternative to `terraform state rm`, which mutates state locally. You write a block with the resource address and `lifecycle { destroy = false }`, and on the next apply Terraform drops the entry from state while the resource stays alive in AWS. It is handy when you hand a resource to another team or another root module: they import it on their side, you remove it on yours, and within one PR cycle both sides can see it.

Что хотят услышать

A senior should: - note that the removed block is the counterpart of the moved block but outward: the resource leaves management instead of moving - separate the two variants: `destroy = false` (keep the resource in the provider) and `destroy = true` (delete it from both state and the provider, which is just a planned destroy) - explain the handover case: team A writes removed, team B writes import, in the same time window - mention that removed takes only an address, not a can filter; a mass rm takes several blocks

Подводные камни

✗ Using `state rm` instead of `removed {}` in a shared project. Colleagues do not see what you did
✗ Writing `removed { destroy = false }` for a resource another root needs, but forgetting to import it in that other root. The resource in the provider becomes an orphan
✗ Thinking the removed block has to stay around for a while. It is deleted in the same release once applied

Follow-up

? When is `removed { destroy = false }` better than `terraform state rm`?
? How do you coordinate removed plus import between two root modules?
? Can you use removed to undo an accidental import?

Глубина в базе знаний

#monolithic-state-splitting

seniorиногда

A monolithic state of 3000 resources, plan takes seven minutes. How do you split it?

Что отвечать

First measure where it stalls: refresh, the graph, the provider. Most of it is usually the refresh of each resource through the provider's API. The splitting strategy: pull the stable layer (VPC, IAM, DNS) apart from the frequently changing one (applications). Each piece is a separate root with its own state. The links between them go through a remote_state data source or through SSM/Parameter Store. Do it in steps: a new empty root, then import resources from the monolith, then a removed block in the monolith, then check that both sides see the same state, then the next piece.

Что хотят услышать

A senior should: - name measuring before splitting: `time terraform plan`, `TF_LOG=trace` to see where refresh hangs - separate the lifecycle layers: networking changes once a quarter, deployment can run once a day. These should not share state - say that -parallelism (10 by default) can be raised to speed up refresh; on large infrastructure it helps, but it is no cure-all - mention `terraform stacks` (Cloud-only, beta in 2025-26) as the coming standard for multi-state coordination - note that the split should be gradual, not "all at once," but one piece, then the next

Подводные камни

✗ Splitting state along 'neat-looking' lines (by resource type) instead of by lifecycle layer. The links between states multiply, and the remote_state links become a bottleneck
✗ Using -parallelism 50 as a 'fix.' The provider starts rejecting requests with a rate limit, the opposite of what you wanted
✗ Doing the split through `terraform state mv` in a team. See the moved/removed point; same problem, just at a bigger scale

Follow-up

? What is `terraform stacks`, and how does it differ from splitting roots by hand?
? How do you measure where exactly plan spends its time?
? When does `-parallelism` help, and when does it hurt?

Глубина в базе знаний

[[tf-large-scale-state]]
[[tf-stacks]]
Refactoring patterns: count to for_each, split files, extract module
state mv, state rm, state pull/push: manual operations

#zero-downtime-resource-migration

seniorиногда

How do you migrate a resource between types with no downtime? Example: an ALB onto a new VPC.

Что отвечать

The canonical pattern is blue/green. (1) Create green: a new ALB in the new VPC alongside the old one. (2) Connect both to one DNS record through weighted routing (Route53 weight 0 on green, 100 on blue). (3) Shift the weight gradually: 10/90, 50/50, 90/10. (4) Once green holds all the traffic and is stable, destroy blue. In Terraform this is two versions of HCL in one root (or two roots), with the DNS weights driven by a variable. Without weighted routing it is cruder, through `create_before_destroy` on the ALB and recreating the DNS records.

Что хотят услышать

A senior should: - name blue/green as a pattern, not a Terraform feature. It is an architectural approach - explain why `create_before_destroy` on its own does not give zero downtime for everything: the nuances are in the dependencies (DNS records, target groups) - mention the need for separate states for blue and green in large migrations, so blue can be removed without risk to green - say that an RDS migration needs a different approach: a read replica, promote, the blue/green built into RDS

Подводные камни

✗ Doing blue/green in one PR with every change at once. The blast radius is huge, and a `revert` cannot undo it in any reasonable time
✗ Not setting a health check on green before shifting traffic. You can take the product down
✗ Forgetting to destroy blue after a successful switch. You pay double for infrastructure for months

Follow-up

? How does blue/green differ from canary?
? How do you do blue/green for RDS, where the data lives in one instance?
? What is the minimum set of observability you need to shift traffic safely?

Глубина в базе знаний

[[tf-blue-green-migration]]
lifecycle: controlling resource behavior
Refactoring patterns: count to for_each, split files, extract module

#plan-slow-on-large-infra

seniorиногда

`terraform plan` takes 7 minutes on 5000 resources. What do you do?

Что отвечать

Start with diagnosis: `TF_LOG=debug terraform plan` shows where the time goes. Usually there are three causes. (1) The refresh of each resource through the API, which you address with `-refresh=false` for quick checks or by splitting state into pieces. (2) A large graph, where thousands of edges get recomputed, which you address by refactoring the HCL structure and the for_each usage. (3) A heavy provider, where the kubernetes provider, for example, pulls API discovery every time. The radical fix is splitting state into roots by lifecycle layer.

Что хотят услышать

A senior should: - name `-refresh=false` for the daily PR plans (fast ones), plus a periodic full plan for drift detection - say that `-target` should not be used as a "speed-up." It distorts the graph and is dangerous - mention `-parallelism N` (10 by default), to be raised carefully since providers rate-limit - note that splitting state by lifecycle is the most durable fix, and the rest are palliatives - mention that Terraform stacks (2025-26) solves this problem natively

Подводные камни

✗ Using `-target` in a pipeline to speed things up. The first drift in an untargeted resource goes unnoticed
✗ Turning refresh off globally. It is bad for drift detection, and the trade-off should be a deliberate one
✗ Setting parallelism = 50 with no backoff in the provider. You get throttling and retries, slower than 10

Follow-up

? Why is `-refresh=false` risky as a long-term strategy?
? Why is `-target` a smell rather than a fix?
? How does `terraform stacks` solve the slow plan problem?

Глубина в базе знаний

[[tf-large-scale-state]]
terraform plan: see what Terraform is about to do
[[tf-stacks]]

Refactoring and large state

5 вопросов · ~26 мин чтения